Perf Issue when reading .csv file using sequential Stage
Moderators: chulett, rschirm, roy
Perf Issue when reading .csv file using sequential Stage
Hi All
I have got a performance issue while reading a .csv file which has 10254768 rows of data in it.
the job flow is as below
Seq file >Transformer > Sort > Transformer > Oracle Stage ( 2 oracle stages ... one for capturing reject data and the other for good data)
Our process runs on two nodes Only
as i was not able to pinpoint the issue, i split the job into two, assuming the performance hindrance is when reading the file.
It was taking 5 minutes to read 10% of the file, which is making the job run for almost 2 hours depending on the server load. The only hiccup is when reading the HUGE volume of data, the data is upserted within 15 minutes into the DB, which i guess is acceptable for now
is there anyway i can improve the performance?
we have started to receive this files from the past five days ..and i am not sure if we will have the same amount in future as well. i will be following up with the source on why the volume has spiked by 50% all of a sudden!!!
any help is appreciated on this issue
PS : I tried searching the forum, but i quit after scanning 50 - 60 pages
Am sorry if i had over looked anything
I have got a performance issue while reading a .csv file which has 10254768 rows of data in it.
the job flow is as below
Seq file >Transformer > Sort > Transformer > Oracle Stage ( 2 oracle stages ... one for capturing reject data and the other for good data)
Our process runs on two nodes Only
as i was not able to pinpoint the issue, i split the job into two, assuming the performance hindrance is when reading the file.
It was taking 5 minutes to read 10% of the file, which is making the job run for almost 2 hours depending on the server load. The only hiccup is when reading the HUGE volume of data, the data is upserted within 15 minutes into the DB, which i guess is acceptable for now
is there anyway i can improve the performance?
we have started to receive this files from the past five days ..and i am not sure if we will have the same amount in future as well. i will be following up with the source on why the volume has spiked by 50% all of a sudden!!!
any help is appreciated on this issue
PS : I tried searching the forum, but i quit after scanning 50 - 60 pages
Am sorry if i had over looked anything
How did you split the job? What other steps present in your reading job?
Try out another test, read and write to a dataset without any sorts or transformers and see how much time that job takes?
What is the byte length of the record?
Try out another test, read and write to a dataset without any sorts or transformers and see how much time that job takes?
What is the byte length of the record?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
As DSGuru said...
You need to find out which stage is actually causing the bottleneck, do that by gradually dropping stages and sending output to a copy stage. I wouldn't use the Upsert either, we have found it to be incredibly slow for large amounts of data, use a change capture to determine updates from inserts and run them separately
You need to find out which stage is actually causing the bottleneck, do that by gradually dropping stages and sending output to a copy stage. I wouldn't use the Upsert either, we have found it to be incredibly slow for large amounts of data, use a change capture to determine updates from inserts and run them separately
I agree on that Chulett, the Tables are in good shape, perfectly indexed, manually extended with more storage space, no other processes are running on the server as of now.
However, there is one interesting thing i have noticed, in the before sub-routine, there is a script which is being called to truncate the table
I manually truncated the table and ran the job after removing this script call ...and the job completed within 45 minutes....
I am not sure, if this is something that is affecting the performance or not .. or it is just me going crazy
However, there is one interesting thing i have noticed, in the before sub-routine, there is a script which is being called to truncate the table
I manually truncated the table and ran the job after removing this script call ...and the job completed within 45 minutes....
I am not sure, if this is something that is affecting the performance or not .. or it is just me going crazy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Create this job and identify the read time.
Investigate using two readers per node.
Code: Select all
SequentialFile -----> Copy
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
That is what that happens under the covers whilst reading from sequential file - but in a combined mode.algfr wrote:I think it is more efficient to import the contents of the seq file using a single column and then dispatching with an import column stage.
Btw, why are two transformers present ? Can't you use a single tx and do both?
Remove the Oracle stage and use a copy stage to see the performance.
It will be a challenge to read such a HUGE set of data in a single column.
the max size i have seen is
I have done more research on the job, and observed that the SORT stage was taking exceptionally long time, i did HASH partition on the Sort keys, and the job performance has improved drastically, it now takes, 20-30 minutes.
I have not changed any data flow, so it makes me much more confident :D
I am debugging more on the job.. will keep the thread posted with the out come
the max size i have seen is
I have done more research on the job, and observed that the SORT stage was taking exceptionally long time, i did HASH partition on the Sort keys, and the job performance has improved drastically, it now takes, 20-30 minutes.
I have not changed any data flow, so it makes me much more confident :D
I am debugging more on the job.. will keep the thread posted with the out come
Hi All
I tried in many ways to optimize the performance in DataStage for loading the 10M rows we are getting now. I was not able to achieve any improevement after a point of time. So, I have switched all my process to SQLLOADER. I was able to apply all the Trims and NULL conditions without any problems and now the data load completes with 30 minutes.
Thanks!!
I tried in many ways to optimize the performance in DataStage for loading the 10M rows we are getting now. I was not able to achieve any improevement after a point of time. So, I have switched all my process to SQLLOADER. I was able to apply all the Trims and NULL conditions without any problems and now the data load completes with 30 minutes.
Thanks!!