DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Tue Jun 19, 2018 10:44 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Windows
Hi,
I am debugging one of the job running in single node (DS 11.5 / windows server) that is taking "about 5 minutes of production run time " .my goal is to identify bottlenecks and performance tuning of job to make sure it runs efficiently (lesser time).

Job design contains reading oracle source data (connector stages-record range 0.1M to 4M), datasets (record range 500 to 100000) being used as reference data for join,lookup operations also several other stages sort ,copy,aggregator ,funnel and transformer stages ..target its writing to 2 data sets .( overall 50 and above stages )



Transformer LOGIC

Code:
**timestamp to varchar conversion ***
If IsNull(DATE_CREATED) Then '' Else If IsValidTimestamp(DATE_CREATED)  Then DateToString(TimestampToDate(DATE_CREATED),"%mm/%dd/%yyyy")
Else ''

**Decimal conversion **
DecimalToString(val,"fix_zero,suppress_zero")


Director job logs


10:16:34 AM Starting Job ..
$APT_CONFIG_FILE = C:\IBM\InformationServer\Server\Configurations\ 1Node.apt
$APT_DISABLE_COMBINATION = True
$APT_SORT_INSERTION_CHECK_ONLY = False
$APT_STRING_PADCHAR = 0x0 (From value file)
$APT_PM_PLAYER_TIMING = False (From value file)
$APT_PM_PLAYER_MEMORY = True (From value file)
$APT_RECORD_COUNTS = True (From value file)
$OSH_PRINT_SCHEMAS = True (From value file)
$APT_DUMP_SCORE = True (From value file)
$APT_STARTUP_STATUS = False (User supplied)

10:16:34 AM Attached Message Handlers:
10:16:37 AM Project message handler: SortStageResort
10:16:37 AM Environment variable settings:
...
Parallel job initiated
OSH script ..
# OSH / orchestrate script for Job
main_program: orchgeneral: loaded
10:16:39 AM orchsort: loaded
10:16:39 AM orchstats: loaded
**oracle connector extract ** (elapsed time =14 sec)
10:16:53 AM ext_CONTACTS_plus: The connector connected to Oracle server oradb1.

**dataset folder**
10:17:54 AM is/isdata/tgt1_ds,0: Heap growth during runLocally(): 86024192 bytes

***Transformer stage Messages (elapsed time =9 sec)
10:18:47 AM xfm_standardize,0 :Input 0 consumed 520749 records.
10:18:47 AM xfm_standardize,0: Output 0 produced 520749 records.
10:18:56 AM xfm_standardize,0: Heap growth during runLocally(): 85639168 bytes

***LOOKUP stage (elapsed time =4 sec)
10:19:06 AM lkp_ob_co_bond_obi,0: Heap growth during runLocally(): 133578752 bytes
10:19:10 AM buffer(1),0: Heap growth during runLocally(): 85590016 bytes
10:19:10 AM buffer(0),0: Heap growth during runLocally(): 85708800 bytes

***funnel stage
10:19:10 AM funl_inv_fiv,0: Heap growth during runLocally(): 85143552 bytes

***sort
10:19:11 AM srt_bond,0: Heap growth during runLocally(): 85983232 bytes
10:19:12 AMbuffer(10),0: Heap growth during runLocally(): 83468288 bytes
**oracle connector stage (elapsed time =22 sec)
10:19:12 AM ext_plus,0: Number of rows fetched on the current node: 1151731.
10:19:12 AM ext_plus,0: Output 0 produced 1151731 records.
10:19:34 AM ext_plus,0: Heap growth during runLocally(): 82477056 bytes
***copy stage ** (elapsed time =1 minute ,10 sec)

10:19:38 AM cpy_stage,0: Output 0 produced 1151731 records.
10:19:38 AM cpy_stage,0: Input 0 consumed 1151731 records.
10:20:48 AM cpy_stage,0: Input 0 consumed 1151731 records.

***join stage ** (Let outer join --Input links -Auto partition ) (elapsed time =1 minute ,4 sec)

10:21:33 AM Jnr_Bond_id,0: Input 0 consumed 1151731 records.
10:21:33 AM Jnr_Bond_id,0: Output 0 produced 1151731 records.
10:22:45AM Jnr_Bond_id,0: Heap growth during runLocally(): 80654336 bytes
10:22:47 AM buffer(23),0: Heap growth during runLocally(): 80687104 bytes
10:22:47 AM buffer(23),0: Heap growth during runLocally(): 80687104 bytes

***dataset ** (partition/collection --Auto )(elapsed time =6 sec)

10:22:49 AM ds_tgt1,0: Input 0 consumed 1207442 records.
10:22:49 AM ds_tgt1,0: Input 0 produced 1207442 records.
10:22:55 AM ds_tgt1,0: Heap growth during runLocally(): 79831040 bytes

10:22:55 AM main_program: Step execution finished with status = OK.
10:22:58 AM main_program: Startup time, 1:18; production run time, 4:57.
10:22:58 AM Parallel job reports successful completion
10:22:59 AM Finished Job .

Redesign :Split the original job into 1 jobs.

job1 :extract oracle data and write to 2 datasets

job2 :use 2 datasets as sources for original job ..it does n't help either rather it becomes worse.

main_program: Startup time, 0:37; production run time, 13:42.

Please give me some insights to performance tune this job.

Thanks.
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42944
Location: Denver, CO
Points: 221471

Post Posted: Tue Jun 19, 2018 2:11 pm Reply with quote    Back to top    

So... let me see... you have a Parallel job with 50+ stages that processes 1 to 4 million records on a Windows server in around 5 minutes and you think the performance needs to be improved? Okay. Might be a bit difficult to whittle that down. Have you tried simply increasing the number of nodes, say to 2 as an experiment? Obviously you'd need to check your partitioning strategy if you haven't already.

I'd also be curious why the performance concern. Is this something that will be running many times over the course of the day, for example?

_________________
-craig

The Old Ones were, the Old Ones are, and the Old Ones shall be. Not in the spaces we know, but between them. They walk serene and primal, undimensioned and to us unseen.
Rate this response:  
Not yet rated
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Wed Jun 20, 2018 4:08 pm Reply with quote    Back to top    

@Chulett,

For the first question regarding increasing number of nodes to 2 , I attempted a test run

Code:
main_program: Startup time, 1:38; production run time, 8:42.

And regarding partitioning Its a hash partition on sort stages after the oracle connector extract then down in the flow I try maintain the same with "same" partitioning every where including input to the join/lookup stages.What does heap growth insist generally ?

currently these jobs run (extract and generate delta files using change data capture ) at every 2 hour interval but the business need is to run at every 1 hour interval if possible , for that I am tuning the jobs for better timing and resource utilization stand point.

Thanks .
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42944
Location: Denver, CO
Points: 221471

Post Posted: Wed Jun 20, 2018 9:46 pm Reply with quote    Back to top    

Sorry, still a bit stuck on this... it needs to run every hour and you're worried about five minute runtime? What was the worst runtime you've seen so far? I'm assuming it is being tracked.

Was worried you actually needed to run it every five minutes. Got tasked with that back in the day: take a nightly task that churned out a ton of XML in a couple of hours and cut it back so it can process "micro-batches" every five minutes. Shocked Managed to do it, though. Added a safety net that the looping Sequence job that controlled it could leverage. Track the previous runtime and if we floated past 5 minutes (gawd forbid it ran 6 minutes) it would always delay the loop until the next 5 minute clock time. Worked surprisingly well. Of course, these were all Server jobs without all of the crazy ramp up and cool down times that PX can be saddled with, but hey - made it work. <thumbsup>

Been far too long since I had any hands-on DataStage experience, so I'll let others give any specific partitioning advice they are willing to share. Unclear what the "heap growth" question is all about or how to answer it so will leave that to others as well. Wink

_________________
-craig

The Old Ones were, the Old Ones are, and the Old Ones shall be. Not in the spaces we know, but between them. They walk serene and primal, undimensioned and to us unseen.
Rate this response:  
Not yet rated
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2856
Location: USA
Points: 21729

Post Posted: Thu Jun 21, 2018 5:09 am Reply with quote    Back to top    

If I had a 60 minute window and the job was taking only 5 to 10 minutes, then I may take a quick glance at tuning then quickly move on.

What jumps out to me is that t he job startup times look pretty bad. What in the world is going on there? Have you got some slow before-SQL statements in your database stages? Our startup times are consistently 2 to 4 seconds even on jobs with 30+ stages. There were some known issues about long startup times so it is possible you are on a version with such an issue and may need a patch. Open a support case to get help to debug it. Please let us know what you find out.

Another thing to question is the overall job design, with whatever it is doing, is it reprocessing the same records and pulling in full volumes of data with each and every run, or could it be redesigned more efficiently at every step to process only the records that are new or updated or deleted.

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
Not yet rated
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Thu Mar 28, 2019 5:50 am Reply with quote    Back to top    

@qt_ky and chulett ,Thank you for the responses.

I am on version 11.5 and open a support case with IBM for their recommendations.

Firstly,I am working on the job changes to use 2 node configuration .One change I did is for Oracle connector properties to Enable partitioned reads --Yes and partitioned method --Rowid range (default))..But I can't find this pseudo-column ROWID in my oracle tables.Is it something the DBA creates for me ? or any suggestions please .
Code:
SELECT Table1.col1,Table2.col2
FROM
Table1
   INNER JOIN
   Table2
      ON table1.ID = table2.ID
WHERE (table1.ROWID BETWEEN '???' AND ‘???')


Apart from connector settings for parallel read/write other thing to remember is to choose appropriate partition method to use in for sort,join and lookup stages in order for job to utilize the 2 node configuration . Current settings I have is "Same" as partition . but would have to change to "Hash".

Another option, also I am looking into overall design to include date filters in sql extracts so that It only pulls the modified records since yesterday or last run .I think by approach there is a considerable amount of data reduction in daily pulls thus also benefits incremental load job(using change capture stage to capture the inserts and updates compared to prior job run) in down stream flow.

I will keep the forum updated with my findings.

Thank you.
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42944
Location: Denver, CO
Points: 221471

Post Posted: Thu Mar 28, 2019 6:19 am Reply with quote    Back to top    

https://docs.oracle.com/cd/B19306_01/server.102/b14200/pseudocolumns008.htm

They are a column that you generally don't need to worry about but if you want to see / use it, you need to explicitly include it.

_________________
-craig

The Old Ones were, the Old Ones are, and the Old Ones shall be. Not in the spaces we know, but between them. They walk serene and primal, undimensioned and to us unseen.
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54514
Location: Sydney, Australia
Points: 295617

Post Posted: Thu Mar 28, 2019 10:48 am Reply with quote    Back to top    

Have you eliminated the insertion of tsort operators on both of the inputs to Join stages?

Why are you running on only one node?

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
Rate this response:  
Not yet rated
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Fri Mar 29, 2019 8:41 am Reply with quote    Back to top    

For the first question , I am using environment variable $APT_SORT_INSERTION_CHECK_ONLY = False to verify the sorted data .Does this serve the purpose to eliminate tsort operator ?
Second one, It was a suggestion by DS admin couple years ago to use only single node since then I haven't revised the configurations.Does increasing node effects on the job output ?I need to test this job with 2 node configurations in lower environment first before I make the configuration changes in production .


Job design and partition

Code:
oracle connectors (inputs) --sort (hash) --join (same)--Lkp (same on stream,auto on reference) --aggregator (same) --funnel (auto) --dataset (auto)


I am considering following factors to debug the performance issue

1.Overall design (bad design may be a cause) --selecting proper partitioning
2.Remove unnecessary sorts stages from the job(currently 15-20 of them)
3.Split the single job into 2 or more jobs (more then 50 stages) ..Probably do all extracts in a single job to write it to data sets then use them in other jobs ?Does it helps ?
4.Review job log for performance related data --set these parameters
$APT_PM_PLAYER_TIMING,$APT_PM_PLAYER_MEMORY,$OSH_PRINT_SCHEMAS,$APT_DUMP_SCORE=True
5.main_program: Startup time, 0:49; production run time, 13:49 ...review events from director log to identify the stage/process at which point job lagged.

I have other job using change capture stage (using data sets to compare and write inserts and updates to a file) along with some other jobs also have an issue each running for more than average of 5-10 minutes so all these jobs together it is taking about a hour to complete the master sequence .So tuning is essential to reduce the timing .
Thank you for the input and suggestions
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54514
Location: Sydney, Australia
Points: 295617

Post Posted: Sat Mar 30, 2019 10:02 pm Reply with quote    Back to top    

First answer: Yes. Second answer: Running on more than one node will increase job throughput up to linearly, assuming that the partitioning is correct. In my experience those who advise single-n ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
Rate this response:  
Not yet rated
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Wed Apr 03, 2019 6:12 am Reply with quote    Back to top    

@ray

I can't see the premium content at this moment .I renewed couple of days ago but still not activated yet Surprised

I did little testing with running same job with different nodes..so having more number of nodes always improve the performance means job run efficiently?
Code:

main_program: Startup time, 5:16; production run time, 8:43 (6 node)
main_program: Startup time, 2:16; production run time, 6:12 (4 node)
main_program: Startup time, 1:28; production run time, 5:49 (2 node)
main_program: Startup time, 1:00; production run time, 5:24(1 node)


In addition environment variable added $DS_PXDEBUG = 1 to this job

Code:
Oracle_Connector_plus,0: Operator completed. status: APT_StatusOk  elapsed: 141.51  user: 24.66  sys: 7.07 (total CPU: 31.73)
Transformer_Validate,1: Operator completed. status: APT_StatusOk  elapsed: 346.06  user: 15.54  sys: 2.37 (total CPU: 17.91)
Transformer_validate,3: Operator completed. status: APT_StatusOk  elapsed: 371.91  user: 7.10  sys: 0.55 (total CPU: 7.64)


What does above CPU messages implies to ?How to identify the bottlenecks from a stage or operator from director log ?

Please suggest me with your tips to debug this issue of long running job.
Thanks.
Rate this response:  
Not yet rated
kumar_s

Premium Poster


since February 2006

Group memberships:
Premium Members, Heartland Usergroup

Joined: 16 Jun 2005
Posts: 5244

Points: 26374

Post Posted: Tue Apr 09, 2019 9:09 pm Reply with quote    Back to top    

With more partition, you startup time and production time is increasing. Are you still running with the Auto partition as you mentioned in your original job design (the first post)? Or using spec ...

_________________
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Rate this response:  
Not yet rated
Developer9



Group memberships:
Premium Members, Heartland Usergroup

Joined: 14 Apr 2011
Posts: 183

Points: 2615

Post Posted: Wed Apr 24, 2019 3:47 am Reply with quote    Back to top    

Hi Kumar,

I am using hash partition in sort stages right after reading from DB stages .Then I am maintaining the "same"partition through out the down stream flow.

One more update regarding CPU ,upon closely monitoring the Operations console,While this job running it is reaching more than 80 % then drops suddenly after job finishes.

I am looking at the option to re-design enter jobs with these kind of issues.

Thank you
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours