Perf Issue when reading .csv file using sequential Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Perf Issue when reading .csv file using sequential Stage

Post by srividya »

Hi All

I have got a performance issue while reading a .csv file which has 10254768 rows of data in it. :shock:

the job flow is as below

Seq file >Transformer > Sort > Transformer > Oracle Stage ( 2 oracle stages ... one for capturing reject data and the other for good data)

Our process runs on two nodes Only :(

as i was not able to pinpoint the issue, i split the job into two, assuming the performance hindrance is when reading the file.

It was taking 5 minutes to read 10% of the file, which is making the job run for almost 2 hours depending on the server load. The only hiccup is when reading the HUGE volume of data, the data is upserted within 15 minutes into the DB, which i guess is acceptable for now

is there anyway i can improve the performance?

we have started to receive this files from the past five days ..and i am not sure if we will have the same amount in future as well. i will be following up with the source on why the volume has spiked by 50% all of a sudden!!!

any help is appreciated on this issue

PS : I tried searching the forum, but i quit after scanning 50 - 60 pages :roll:
Am sorry if i had over looked anything :?
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

How did you split the job? What other steps present in your reading job?
Try out another test, read and write to a dataset without any sorts or transformers and see how much time that job takes?
What is the byte length of the record?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

As DSGuru said...

You need to find out which stage is actually causing the bottleneck, do that by gradually dropping stages and sending output to a copy stage. I wouldn't use the Upsert either, we have found it to be incredibly slow for large amounts of data, use a change capture to determine updates from inserts and run them separately
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

Thanks for the response, i have just completed the run for loading data from a sequential file to a Oracle table which was a Plain Insert .. took 1 hour 13 minutes to complete :cry:

this job was a plain load ..read from Sequential stage and write to Oracle Stage
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:!: Don't blame the sequential file reads for that. 'Plain inserts' are not speedy and it can also be affected by the indexes on your target table, extents, system load, etc etc.
-craig

"You can never have too many knives" -- Logan Nine Fingers
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

I agree on that Chulett, the Tables are in good shape, perfectly indexed, manually extended with more storage space, no other processes are running on the server as of now.

However, there is one interesting thing i have noticed, in the before sub-routine, there is a script which is being called to truncate the table

I manually truncated the table and ran the job after removing this script call ...and the job completed within 45 minutes....

I am not sure, if this is something that is affecting the performance or not .. or it is just me going crazy :roll:
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

When the job took 1 hour 13 minutes, how much of that time was spent running the before script? You can tell from the detail in the logs.
-craig

"You can never have too many knives" -- Logan Nine Fingers
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

Hi Chullet

It took approximately 30 mins , to give me the message that the Before subroutine has been implemented

I have just got the table analysed and raised a request for manual extension, hopefully, it will improve the performance.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

And you're certain all it is doing is a truncate? That sound take... moments. Perhaps it is doing a transactional delete of records in the table instead. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Create this job and identify the read time.

Code: Select all

SequentialFile  ----->  Copy
Investigate using two readers per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

yes, it is a simple Truncate table statement. Have a call with the DBA tomorrow morning. Probable i will have more information after the call :D

how much ever i get work on this table, i will need to re-design the data flow. Bad design :evil: ... took over the code couple of days back and it looks messy
algfr
Participant
Posts: 106
Joined: Fri Sep 09, 2005 7:42 am

Post by algfr »

I think it is more efficient to import the contents of the seq file using a single column and then dispatching with an import column stage.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

algfr wrote:I think it is more efficient to import the contents of the seq file using a single column and then dispatching with an import column stage.
That is what that happens under the covers whilst reading from sequential file - but in a combined mode.

Btw, why are two transformers present ? Can't you use a single tx and do both?

Remove the Oracle stage and use a copy stage to see the performance.
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

It will be a challenge to read such a HUGE set of data in a single column.
the max size i have seen is

I have done more research on the job, and observed that the SORT stage was taking exceptionally long time, i did HASH partition on the Sort keys, and the job performance has improved drastically, it now takes, 20-30 minutes.

I have not changed any data flow, so it makes me much more confident :D

I am debugging more on the job.. will keep the thread posted with the out come
srividya
Participant
Posts: 62
Joined: Thu Aug 25, 2005 2:31 am
Location: Ashburn,VA

Post by srividya »

Hi All

I tried in many ways to optimize the performance in DataStage for loading the 10M rows we are getting now. I was not able to achieve any improevement after a point of time. So, I have switched all my process to SQLLOADER. I was able to apply all the Trims and NULL conditions without any problems and now the data load completes with 30 minutes.

Thanks!!
Post Reply