DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message

Joined: 14 Dec 2007
Posts: 14
Location: Chennai
Points: 135

Post Posted: Sat Mar 17, 2018 9:34 pm Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
Additional info: related to BigIntegrate installed in Hadoop environment

We are currently using dataset stage for creating intermediate files in HDFS. The descriptor file is created on the edge node (Linux) and data files resides on the data nodes (HDFS). My questions are:

1) Do we have any I/O performance overhead since descriptor is not in HDFS and data files are in HDFS?
2) IS BDFS stage better than Dataset stage as it is a Hadoop native stage?

Need your expert advice in this.

Thanks in Advance!

Prem R.

Premium Poster

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 16 Nov 2004
Posts: 16311
Location: Germany
Points: 92496

Post Posted: Mon Mar 19, 2018 2:29 am Reply with quote    Back to top    

I haven't any relevant experience on your second question, but regarding the first one: The DataSet descriptor file is a small file which merely contains control information and points to the act ...


Rate this response:  
Not yet rated
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1675
Location: Colleyville, Texas
Points: 22771

Post Posted: Tue Apr 03, 2018 12:04 pm Reply with quote    Back to top    

Per what Arndt said, the descriptor file is tiny - it just describes where the parts are - no impact on performance.

The data portion of the dataset may actually be located in HDFS. That depends on the settings in the $DSHOME/yarn.config file. In it there's the following section:

# By default with dynamic configuration files YARN will use HDFS for datasets, filesets
# and lookup tables.  With static configuration files YARN will use local disk for these
# files.  This setting can be used to override these defaults.
# It accepts a value of true or false.

If it is set to true and the APT config file is dynamic then the Resource Disk paths in the APT configuration file reference an HDFS path, not a Linux path.

That in effect gives you the benefits of a partitioned dataset that is in the HDFS file system.

Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2017
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours