Perf Issue when reading .csv file using sequential Stage

srividya · Post by **srividya** » Mon Sep 07, 2009 2:40 pm

Hi All

I have got a performance issue while reading a .csv file which has 10254768 rows of data in it.

the job flow is as below

Seq file >Transformer > Sort > Transformer > Oracle Stage ( 2 oracle stages ... one for capturing reject data and the other for good data)

Our process runs on two nodes Only

as i was not able to pinpoint the issue, i split the job into two, assuming the performance hindrance is when reading the file.

It was taking 5 minutes to read 10% of the file, which is making the job run for almost 2 hours depending on the server load. The only hiccup is when reading the HUGE volume of data, the data is upserted within 15 minutes into the DB, which i guess is acceptable for now

is there anyway i can improve the performance?

we have started to receive this files from the past five days ..and i am not sure if we will have the same amount in future as well. i will be following up with the source on why the volume has spiked by 50% all of a sudden!!!

any help is appreciated on this issue

PS : I tried searching the forum, but i quit after scanning 50 - 60 pages

Am sorry if i had over looked anything

DSguru2B · Post by **DSguru2B** » Mon Sep 07, 2009 3:47 pm

How did you split the job? What other steps present in your reading job?
Try out another test, read and write to a dataset without any sorts or transformers and see how much time that job takes?
What is the byte length of the record?

Kryt0n · Post by **Kryt0n** » Mon Sep 07, 2009 3:53 pm

As DSGuru said...

You need to find out which stage is actually causing the bottleneck, do that by gradually dropping stages and sending output to a copy stage. I wouldn't use the Upsert either, we have found it to be incredibly slow for large amounts of data, use a change capture to determine updates from inserts and run them separately

srividya · Post by **srividya** » Mon Sep 07, 2009 4:01 pm

Thanks for the response, i have just completed the run for loading data from a sequential file to a Oracle table which was a Plain Insert .. took 1 hour 13 minutes to complete

this job was a plain load ..read from Sequential stage and write to Oracle Stage

chulett · Post by **chulett** » Mon Sep 07, 2009 5:04 pm

Don't blame the sequential file reads for that. 'Plain inserts' are not speedy and it can also be affected by the indexes on your target table, extents, system load, etc etc.

srividya · Post by **srividya** » Mon Sep 07, 2009 5:22 pm

I agree on that Chulett, the Tables are in good shape, perfectly indexed, manually extended with more storage space, no other processes are running on the server as of now.

However, there is one interesting thing i have noticed, in the before sub-routine, there is a script which is being called to truncate the table

I manually truncated the table and ran the job after removing this script call ...and the job completed within 45 minutes....

I am not sure, if this is something that is affecting the performance or not .. or it is just me going crazy

chulett · Post by **chulett** » Mon Sep 07, 2009 10:19 pm

When the job took 1 hour 13 minutes, how much of that time was spent running the before script? You can tell from the detail in the logs.

srividya · Post by **srividya** » Mon Sep 07, 2009 11:44 pm

Hi Chullet

It took approximately 30 mins , to give me the message that the Before subroutine has been implemented

I have just got the table analysed and raised a request for manual extension, hopefully, it will improve the performance.

chulett · Post by **chulett** » Mon Sep 07, 2009 11:58 pm

And you're certain all it is doing is a truncate? That sound take... moments. Perhaps it is doing a transactional delete of records in the table instead.

ray.wurlod · Post by **ray.wurlod** » Tue Sep 08, 2009 12:26 am

Create this job and identify the read time.

Code: Select all

SequentialFile  ----->  Copy

Investigate using two readers per node.

srividya · Post by **srividya** » Tue Sep 08, 2009 12:27 am

yes, it is a simple Truncate table statement. Have a call with the DBA tomorrow morning. Probable i will have more information after the call :D

how much ever i get work on this table, i will need to re-design the data flow. Bad design

... took over the code couple of days back and it looks messy

algfr · Post by **algfr** » Tue Sep 08, 2009 5:21 am

I think it is more efficient to import the contents of the seq file using a single column and then dispatching with an import column stage.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue Sep 08, 2009 5:37 am

algfr wrote:I think it is more efficient to import the contents of the seq file using a single column and then dispatching with an import column stage.

That is what that happens under the covers whilst reading from sequential file - but in a combined mode.

Btw, why are two transformers present ? Can't you use a single tx and do both?

Remove the Oracle stage and use a copy stage to see the performance.

srividya · Post by **srividya** » Tue Sep 08, 2009 10:28 am

It will be a challenge to read such a HUGE set of data in a single column.
the max size i have seen is

I have done more research on the job, and observed that the SORT stage was taking exceptionally long time, i did HASH partition on the Sort keys, and the job performance has improved drastically, it now takes, 20-30 minutes.

I have not changed any data flow, so it makes me much more confident :D

I am debugging more on the job.. will keep the thread posted with the out come

srividya · Post by **srividya** » Wed Feb 24, 2010 8:22 pm

Hi All

I tried in many ways to optimize the performance in DataStage for loading the 10M rows we are getting now. I was not able to achieve any improevement after a point of time. So, I have switched all my process to SQLLOADER. I was able to apply all the Trims and NULL conditions without any problems and now the data load completes with 30 minutes.

Thanks!!