BigIntegrate in Hadoop - Dataset stage vs BDFS stage

prematram · Post by **prematram** » Sat Mar 17, 2018 9:34 pm

Hi,

We are currently using dataset stage for creating intermediate files in HDFS. The descriptor file is created on the edge node (Linux) and data files resides on the data nodes (HDFS). My questions are:

1) Do we have any I/O performance overhead since descriptor is not in HDFS and data files are in HDFS?
2) IS BDFS stage better than Dataset stage as it is a Hadoop native stage?

Need your expert advice in this.

Thanks in Advance!

ArndW · Post by **ArndW** » Mon Mar 19, 2018 2:29 am

I haven't any relevant experience on your second question, but regarding the first one:

The DataSet descriptor file is a small file which merely contains control information and points to the actual data files. It can be located anywhere and its location won't affect dataset R/W performance.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Apr 03, 2018 12:04 pm

Per what Arndt said, the descriptor file is tiny - it just describes where the parts are - no impact on performance.

The data portion of the dataset may actually be located in HDFS. That depends on the settings in the $DSHOME/yarn.config file. In it there's the following section:

Code: Select all

APT_YARN_USE_HDFS=true
# By default with dynamic configuration files YARN will use HDFS for datasets, filesets
# and lookup tables.  With static configuration files YARN will use local disk for these
# files.  This setting can be used to override these defaults.
# It accepts a value of true or false.

If it is set to true and the APT config file is dynamic then the Resource Disk paths in the APT configuration file reference an HDFS path, not a Linux path.

That in effect gives you the benefits of a partitioned dataset that is in the HDFS file system.