Loading data from HDFS file into HIVE table using Datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rsmohankumar
Participant
Posts: 1
Joined: Mon Mar 04, 2013 8:59 am

Loading data from HDFS file into HIVE table using Datastage

Post by rsmohankumar »

Hi all,

We are loading the data from CSV file (accessing using Bigdata file stage) in HDFS filesystem into the HIVE table using the JDBC stage in Datastage 11.5. Performance of loading is worst. It takes 22 seconds to insert one record into the HIVE table. Can you please let us know what can be do to improve the performance of loading through JDBC stage.

We guess the data is inserting as one row at a time into HIVE table even-though we gave 2000 rows per transactions.

Thanks in advance.
Thanks,
Mohan
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Welcome!

Rows per transaction just tells it when to commit. If you have an 'Array Size' property there, that would be what controls how many to send to the target at a time.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Timato
Participant
Posts: 24
Joined: Tue Sep 30, 2014 10:51 pm

Post by Timato »

Which distribution of Hadoop are you using? From what I can gather, the BigData File Stage is primarily aimed at IBM's BigInsights and i'd imagine there may be issues when interacting with other distributions.

Have you tried using the File Connector stage instead? WebHDFS/HTTPFS is standard with most HDFS versions I think?
TNZL_BI
Premium Member
Premium Member
Posts: 24
Joined: Mon Aug 20, 2012 5:15 am
Location: NZ

Post by TNZL_BI »

Hi , I have been facing a similar issue. I am using the Hive Connector Stage to load / extract data .

However , the speed is dismal . Is there something that we can do to improve the performance of loading into Hive. Having said that , I dont expect Hive loading to be as fast as any other database , as Hive is just an easy interface which is database like but not a database in the typical sense since beneath the Hive surface , there are complex Java map reduce programs running.

Never the less , do we know of some ways to get this tuned. I see the array size in ODBC stage but not in the native Hive Connector stage .

Any info here in regards to fine tuning performance will be really helpful
TNZL_BI
Premium Member
Premium Member
Posts: 24
Joined: Mon Aug 20, 2012 5:15 am
Location: NZ

Post by TNZL_BI »

I have been suggested by IBM to run some patches. Will install this and then update .
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

I have talked with other customers who use the File Connector exclusively for loading --- writing directly to the hdfs file that Hive is abstracting --- precisely for performance reasons.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
TNZL_BI
Premium Member
Premium Member
Posts: 24
Joined: Mon Aug 20, 2012 5:15 am
Location: NZ

Post by TNZL_BI »

Exactly. I have been using the file connector stage now and its a better / faster way to put in data onto Hadoop rather than use a Hive or ODBC connector stage.
The other advantage is that the file connector stage also provides an option to create a Hive table as well which is like 2 steps in one.
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Post by dsuser_cai »

We use BigData Stage in a job to load data to HDFS and then use a script to create the HIVE table with correct partitions. We store data in /folder/structure/for_Hive/tableName/yyyy/mm/dd folder format and the HIVE tables are partitioned on Year, month and date. Both the loading HDFS and creating HIVE table is executed from a Job sequence.
Thanks
Karthick
Post Reply