BigIntegrate in Hadoop - Dataset stage vs BDFS stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
prematram
Participant
Posts: 14
Joined: Fri Dec 14, 2007 2:01 am
Location: Chennai

BigIntegrate in Hadoop - Dataset stage vs BDFS stage

Post by prematram »

Hi,

We are currently using dataset stage for creating intermediate files in HDFS. The descriptor file is created on the edge node (Linux) and data files resides on the data nodes (HDFS). My questions are:

1) Do we have any I/O performance overhead since descriptor is not in HDFS and data files are in HDFS?
2) IS BDFS stage better than Dataset stage as it is a Hadoop native stage?

Need your expert advice in this.

Thanks in Advance!
Prem R.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I haven't any relevant experience on your second question, but regarding the first one:

The DataSet descriptor file is a small file which merely contains control information and points to the actual data files. It can be located anywhere and its location won't affect dataset R/W performance.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Per what Arndt said, the descriptor file is tiny - it just describes where the parts are - no impact on performance.

The data portion of the dataset may actually be located in HDFS. That depends on the settings in the $DSHOME/yarn.config file. In it there's the following section:

Code: Select all

APT_YARN_USE_HDFS=true
# By default with dynamic configuration files YARN will use HDFS for datasets, filesets
# and lookup tables.  With static configuration files YARN will use local disk for these
# files.  This setting can be used to override these defaults.
# It accepts a value of true or false.
If it is set to true and the APT config file is dynamic then the Resource Disk paths in the APT configuration file reference an HDFS path, not a Linux path.

That in effect gives you the benefits of a partitioned dataset that is in the HDFS file system.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Post Reply