Page 1 of 1

Join Stage slowness in parallel job

Posted: Sun Jul 11, 2021 8:28 am
by ayadav77
I am trying to migrate data from SQL Server to Postgres DB using DataStage parallel job, but join stage is very i.e. it is joining record as 100 row/sec

Code: Select all

			[db2 refernce data]db2 Connector
					|
					rlink
					|
					|						   
[SQL Server]-------------plink--------[Join Stg]-------------------[Postgres db]
 ODBC connector                           			ODBC connector

Source data: 100 million
Reference data: 3 million
Join: Left Outer Join

I have sorted the data in db query based on the key(varchar) column and used hash partitioning on both the links(plink & rlink).
Job is running on 8 node config file. I have used Array size and Row count to 200000 in both source and target connectors.
After doing all these configuration, datastage job is taking more than 48 hrs.

Can someone help me to optimized this job? Please let me know any possible solution other than bulk load.

Note: Target db is on aws cloud and target table has primary key index only.

Re: Join Stage slowness in parallel job

Posted: Mon Jul 12, 2021 7:16 am
by PaulVL
Don't go boosting that array size without taking into account what it is doing to your memory buffers.

Here's my rule of thumb. Each stage has (by default) about 3MB of buffer space. I always recommend calculating your row length, then taking a third of that buffer space and see how many rows fit into that space. so 1,000,000 / row length = X. Round X down to the nearest 100 value. Use that number for your array size. That way, you will not cause datastage to write temporary buffer files to your TMPDIR location. Each time you read and write to disk you slow the process down.

Next is the very misleading rows per second you may be seeing when doing a monitor activity. It misleads to you. Plain and simple. I always look at the live data update counts. Set your refresh to 10 seconds. Write down the value you see on your desired stage. Wait for the refresh then write down the new value. New Value - Old Valud = Num of rows in that 10 seconds. Then divide by 10. That is your CURRENT rows per second on that stage.

If you just look at the right hand side rows per second value, it does not reflect your current speed of processing data in that stage because the calculation is based upon START TIME of job, not the start time in which that stage got it's first row.

Re: Join Stage slowness in parallel job

Posted: Wed Jul 14, 2021 6:15 am
by ayadav77
Thanks For Reply!!!

I tried making Array size to 70000, but couldn't find any success.
row length=200
total number of rows=14000000

Do I need to set APT_DEFAULT_TRANSPORT_BLOCK_SIZE as well?

Re: Join Stage slowness in parallel job

Posted: Wed Jul 14, 2021 6:23 am
by PaulVL
1,000,000 / 200 = 5,000.

Try that as your array size.

Re: Join Stage slowness in parallel job

Posted: Wed Jul 14, 2021 10:38 am
by ayadav77
Just for your information.
I have taken the length of row from table schema i.e. sum of the length defined in table for each columns.
Is this correct?

Re: Join Stage slowness in parallel job

Posted: Wed Jul 14, 2021 1:31 pm
by PaulVL
yes.

Re: Join Stage slowness in parallel job

Posted: Mon Jul 26, 2021 8:09 am
by PaulVL
Any speed improvement?