Join Stage slowness in parallel job

ayadav77 · Post by **ayadav77** » Sun Jul 11, 2021 8:28 am

I am trying to migrate data from SQL Server to Postgres DB using DataStage parallel job, but join stage is very i.e. it is joining record as 100 row/sec

Code: Select all

			[db2 refernce data]db2 Connector
					|
					rlink
					|
					|						   
[SQL Server]-------------plink--------[Join Stg]-------------------[Postgres db]
 ODBC connector                           			ODBC connector

Source data: 100 million
Reference data: 3 million
Join: Left Outer Join

I have sorted the data in db query based on the key(varchar) column and used hash partitioning on both the links(plink & rlink).
Job is running on 8 node config file. I have used Array size and Row count to 200000 in both source and target connectors.
After doing all these configuration, datastage job is taking more than 48 hrs.

Can someone help me to optimized this job? Please let me know any possible solution other than bulk load.

Note: Target db is on aws cloud and target table has primary key index only.

PaulVL · Post by **PaulVL** » Mon Jul 12, 2021 7:16 am

Don't go boosting that array size without taking into account what it is doing to your memory buffers.

Here's my rule of thumb. Each stage has (by default) about 3MB of buffer space. I always recommend calculating your row length, then taking a third of that buffer space and see how many rows fit into that space. so 1,000,000 / row length = X. Round X down to the nearest 100 value. Use that number for your array size. That way, you will not cause datastage to write temporary buffer files to your TMPDIR location. Each time you read and write to disk you slow the process down.

Next is the very misleading rows per second you may be seeing when doing a monitor activity. It misleads to you. Plain and simple. I always look at the live data update counts. Set your refresh to 10 seconds. Write down the value you see on your desired stage. Wait for the refresh then write down the new value. New Value - Old Valud = Num of rows in that 10 seconds. Then divide by 10. That is your CURRENT rows per second on that stage.

If you just look at the right hand side rows per second value, it does not reflect your current speed of processing data in that stage because the calculation is based upon START TIME of job, not the start time in which that stage got it's first row.

ayadav77 · Post by **ayadav77** » Wed Jul 14, 2021 6:15 am

Thanks For Reply!!!

I tried making Array size to 70000, but couldn't find any success.
row length=200
total number of rows=14000000

Do I need to set APT_DEFAULT_TRANSPORT_BLOCK_SIZE as well?

PaulVL · Post by **PaulVL** » Wed Jul 14, 2021 6:23 am

1,000,000 / 200 = 5,000.

Try that as your array size.

ayadav77 · Post by **ayadav77** » Wed Jul 14, 2021 10:38 am

Just for your information.
I have taken the length of row from table schema i.e. sum of the length defined in table for each columns.
Is this correct?

PaulVL · Post by **PaulVL** » Wed Jul 14, 2021 1:31 pm

PaulVL · Post by **PaulVL** » Mon Jul 26, 2021 8:09 am

Any speed improvement?

DSXchange

Join Stage slowness in parallel job

Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job

Re: Join Stage slowness in parallel job