Join Stage slowness in parallel job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ayadav77
Participant
Posts: 26
Joined: Wed Jun 18, 2014 11:51 am
Location: Pune

Join Stage slowness in parallel job

Post by ayadav77 »

I am trying to migrate data from SQL Server to Postgres DB using DataStage parallel job, but join stage is very i.e. it is joining record as 100 row/sec

Code: Select all

			[db2 refernce data]db2 Connector
					|
					rlink
					|
					|						   
[SQL Server]-------------plink--------[Join Stg]-------------------[Postgres db]
 ODBC connector                           			ODBC connector

Source data: 100 million
Reference data: 3 million
Join: Left Outer Join

I have sorted the data in db query based on the key(varchar) column and used hash partitioning on both the links(plink & rlink).
Job is running on 8 node config file. I have used Array size and Row count to 200000 in both source and target connectors.
After doing all these configuration, datastage job is taking more than 48 hrs.

Can someone help me to optimized this job? Please let me know any possible solution other than bulk load.

Note: Target db is on aws cloud and target table has primary key index only.
Regards,
Anil
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Re: Join Stage slowness in parallel job

Post by PaulVL »

Don't go boosting that array size without taking into account what it is doing to your memory buffers.

Here's my rule of thumb. Each stage has (by default) about 3MB of buffer space. I always recommend calculating your row length, then taking a third of that buffer space and see how many rows fit into that space. so 1,000,000 / row length = X. Round X down to the nearest 100 value. Use that number for your array size. That way, you will not cause datastage to write temporary buffer files to your TMPDIR location. Each time you read and write to disk you slow the process down.

Next is the very misleading rows per second you may be seeing when doing a monitor activity. It misleads to you. Plain and simple. I always look at the live data update counts. Set your refresh to 10 seconds. Write down the value you see on your desired stage. Wait for the refresh then write down the new value. New Value - Old Valud = Num of rows in that 10 seconds. Then divide by 10. That is your CURRENT rows per second on that stage.

If you just look at the right hand side rows per second value, it does not reflect your current speed of processing data in that stage because the calculation is based upon START TIME of job, not the start time in which that stage got it's first row.
ayadav77
Participant
Posts: 26
Joined: Wed Jun 18, 2014 11:51 am
Location: Pune

Re: Join Stage slowness in parallel job

Post by ayadav77 »

Thanks For Reply!!!

I tried making Array size to 70000, but couldn't find any success.
row length=200
total number of rows=14000000

Do I need to set APT_DEFAULT_TRANSPORT_BLOCK_SIZE as well?
Regards,
Anil
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Re: Join Stage slowness in parallel job

Post by PaulVL »

1,000,000 / 200 = 5,000.

Try that as your array size.
ayadav77
Participant
Posts: 26
Joined: Wed Jun 18, 2014 11:51 am
Location: Pune

Re: Join Stage slowness in parallel job

Post by ayadav77 »

Just for your information.
I have taken the length of row from table schema i.e. sum of the length defined in table for each columns.
Is this correct?
Regards,
Anil
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Re: Join Stage slowness in parallel job

Post by PaulVL »

yes.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Re: Join Stage slowness in parallel job

Post by PaulVL »

Any speed improvement?
Post Reply