Trying to understand job score (1)

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Trying to understand job score (1)

Post by abc123 »

I went through all posts, documentation and extra documentation from IBM. Here is the job design:

DataSet -> Sort -> Copy -> SequentialFile

Here are the first nine lines of the score:

Code: Select all

1)main_program: This step has 4 datasets:
2)ds0: {/opt/IBM/InformationServer/Server/Datasets/Three.ds
3)      eAny=>eCollectAny
4)      op0[4p] (parallel Source)}
5)ds1: {op0[4p] (parallel Source)
6)      eOther(APT_HashPartitioner { key={ value=Col1, subArgs={ cs }
7) }
8)})#>eCollectAny
9)      op1[4p] (parallel APT_CombinedOperatorController:Sort_21.One_Sort)}
Questions:

1) Line1: I know the 4 datasets it is talking about are virtual datasets. How does the framework come up with 4?
2) Line3: What does 'eAny=>eCollectAny' mean? What is eAny and eCollectAny?
3) Line6: I am assuming that the eOther means a non-default partitioning method is being used. Is this right?
4) Line6: What does APT_HashPartitioner mean in this situation?
5) Line8: What does '#>eCollectAny' mean?

I would appreciate any response.
LS
Participant
Posts: 5
Joined: Tue Mar 15, 2011 3:48 pm
Location: Europe
Contact:

Re: Trying to understand job score (1)

Post by LS »

Hi abc123.

1: You can check what are the virtual datasets by looking at what d0, d1, d2 and d3 are. The job score explains exactly why the 4 datasets.
2: eAny means round robin partitioner; eCollectAny means round robin collector; "=>" means parallel to parallel - same.
3: In this case yes.
4: That it is doing a hash partitioning.
5: "#>" means parallel to parallel - not same.

Have fun,
me.

[snip]
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Data sets are virtual or physical. In the example, ds0 is a phyiscal data set. There is a virtual data set associated with each link in the job that is actually executed.

eAny is actually "(Auto)", not Round Robin. Likewise eCollectAny is "(Auto)".

When eOther is used the actual algorithm is specified in parenthese following it. In this case it was Hash on Col1 case sensitive. APT_HashPartitioner is the partitioner that uses Hash as its algorithm.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

LS and Ray, thank you for your responses.

Here is the complete score:
--------------------------------------------------------------------------------
1)main_program: This step has 4 datasets:
2)ds0: {/opt/IBM/InformationServer/Server/Datasets/Three.ds
3) eAny=>eCollectAny
4) op0[4p] (parallel Source)}
5)ds1: {op0[4p] (parallel Source)
6) eOther(APT_HashPartitioner { key={ value=Col1,
subArgs={ cs }
7) }
8)})#>eCollectAny
9) op1[4p] (parallel APT_CombinedOperatorController:Sort_21.One_Sort)}

10) ds2: {op1[4p] (parallel APT_CombinedOperatorController:Copy_23)
11) eSame=>eCollectAny
12) op2[4p] (parallel buffer(0))}
13) ds3: {op2[4p] (parallel buffer(0))
14) >>eCollectOther(APT_SortedMergeCollector { key={ value=Col1,
15) subArgs={ asc, cs }
16) }
17) })
18) op3[1p] (sequential APT_RealFileExportOperator in Sequential_File_28)}
19) It has 4 operators:
20) op0[4p] {(parallel Source)
21) on nodes (
22) node1[op0,p0]
23) node2[op0,p1]
24) node3[op0,p2]
25) node4[op0,p3]
26) )}
27) op1[4p] {(parallel APT_CombinedOperatorController:
28) (Sort_21.One_Sort)
29) (Sort_21)
30) (Copy_23)
31) ) on nodes (
32) node1[op1,p0]
33) node2[op1,p1]
34) node3[op1,p2]
35) node4[op1,p3]
36) )}
37) op2[4p] {(parallel buffer(0))
38) on nodes (
39) node1[op2,p0]
40) node2[op2,p1]
41) node3[op2,p2]
42) node4[op2,p3]
43) )}
44) op3[1p] {(sequential APT_RealFileExportOperator in Sequential_File_28)
45) on nodes (
46) node2[op3,p0]
47) )}
48) It runs 13 processes on 4 nodes.
-----------------------------------------------------------------------------

Few more questions:

1) So eAny is Auto Partitioning and eCollectAny is Auto Collecting, right?

2) So ds0 is the physical dataset and for each link (there are 3 in the job) the datasets are ds1, ds2 and ds3, right?

3) If the Sort stage is changed to sequential execution mode, number of datasets increases to 5. Do you know why?

4) Line11: Is it saying that maintain partitioning? Why is it doing for the Copy stage in Line 10? Shouldn't it be doing for the Sort as the Copy stage is scored out?

5) Line48:
Where do we get 13 processes from?
The Dataset stage has 1 operator that spawns 4 processes. The Sort combines with the Copy and has 2 operators (the Copy does not have any operator but the Sort has 2 - psort and tsort, right?) and spawns 8 processes (seems like!). The final sequential file stage has the Import operator and spawns 1 process.
Can you please clarify/correct this statement?
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

To answer your questions:

1) Yes

2) In this job score, that is correct. The numbers are irrelevant, they are just the order in which they happened to be identified.

3) Because a sequentially-operating operator is not combinable with a parallel-operating operator. Now there is an additional link in action and therefore an additional dataset. You should be able to see this in the score for that job.

4) Do you have the Force property set in the Copy stage options? Sort_21 and Copy_23 are running as a combined operator according to your score, so the Copy has not been removed at runtime.

5) Sort runs 1 operator, which one is used (tsort or psort) depends upon which type of sort you selected in the stage options.
The score shows you in lines 19-48 which operators (processes) are running on which node(s). op0 (DataSet stage) runs on 4 nodes, op1 (Combined Sort/Copy) runs on 4 nodes, op2 (inserted Buffer operator) runs on 4 nodes and op3 (Sequential File stage) runs on 1 node. 4+4+4+1 = 13

You should try disabling operator combination (APT_DISABLE_COMBINATION=1) and examining the score from that run, then compare that score to this one to see how it changes. Also, OSH_DUMP would be useful to see the final version of the OSH that actually gets executed.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

JWiles, thank you for your response. My apologies for the long post. Here is my understanding. Please correct any statements and answer any question if possible. We can use this post to help others understand the score.

1)main_program: This step has 4 datasets:
2)ds0: {/opt/IBM/InformationServer/Server/Datasets/Three.ds
MEANS: This is the first and only physical dataset.

3) eAny=>eCollectAny
MEANS: In the source DataSet stage, it is Auto Partitioning and in the Sort stage it is Auto Collecting.

4) op0[4p] (parallel Source)}
MEANS: First operator is the target operator for the first dataset (there is no source operator for this).

5)ds1: {op0[4p] (parallel Source)
6) eOther(APT_HashPartitioner { key={ value=Col1,
subArgs={ cs }
7) }
8)})#>eCollectAny
9) op1[4p] (parallel APT_CombinedOperatorController:Sort_21.One_Sort)}
LINES 5-9 MEANS:
a)Dataset operator is the source operator for this 2nd dataset.
b)Other type of partitioning (other than the Auto) is being used. On Col1, case sensitive.
c)Rows from this partitioning are being collected using Auto collecting.
QUESTION: (b) above is happening in the Sort stage. How can the collection happen in the same stage?


10) ds2: {op1[4p] (parallel APT_CombinedOperatorController:Copy_23)
MEANS: Data set2: This is on the link between Sort and Copy.

11) eSame=>eCollectAny
MEANS: The partitioning method is continuing.
QUESTION: Exactly at what point is this happening? Within a stage? At the input of a stage? Which stage?


12) op2[4p] (parallel buffer(0))}
MEANS: Datastage inserts a buffer operator to prevent deadlock.

13) ds3: {op2[4p] (parallel buffer(0))
MEANS: Data set3: This is on the link between Copy and the SequentialFile stage.

14) >>eCollectOther(APT_SortedMergeCollector { key={ value=Col1,
15) subArgs={ asc, cs }
16) }
17) })
LINES 14,15: 2 QUESTIONS:
a)Is this saying that, since the initial partitioning was on Col1, the collecting is happening the same way?

b)Exactly at what point of the flow is this happening? That is, is it in the input of the SequentialFile stage?


18) op3[1p] (sequential APT_RealFileExportOperator in Sequential_File_28)}
MEANS: This is the last operator. An export is happening to it.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You're close!

1/2) Yes, in THIS job, this is the only physical dataset. That does NOT mean it's automatically assigned the id ds0.

3) I think more accurate in this case is that the input dataset (the physical one) is not being repartitioned

4) This would be the DataSet stage--the name here will not always match what you see in the Designer palette.

So to break it down by dataset:

1-4) ds0: describes the first dataset and what it feeds into (the Dataset stage or "Source" - actually the copy operator IIRC). No repartitioning.

5-9) ds1: More or less correct. The output of the Dataset operator is the source for ds1, it is being Hash partitioned and fed into the Sort operator with auto collection.

10-12) ds2 is the output of the Copy and feeds into the inserted buffer operator. Because the Sort and Copy are combined into op1, there is no true dataset linking them. The virtual datasets are between processes, not operators within the same process (combined operator or composite operator).

13-18 ) ds3 links the output of the parallel buffer to the input of the Sequential File stage (actually the export operator). The SortedMerge collector is used--I don't know if you selected that in the Stage or if that's what DS chose. If DS chose it, maybe the engine inferred that was what the job's intention was, to have a sorted sequential file. Another collection method wouldn't have guaranteed that result.

A basic dataset description as shown in the score:

ds#: {source of dataset, Partition>>Collection, target of dataset)}

I don't recall exactly where partitioning/collection physically happens (either at the input or output of a dataset). I'm going to SWAG that partitioning happens at the source end of a dataset/link and collection happens at the target end. It generally isn't a separate process but is attached to another, although I won't promise that is always the case :)

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's useful to remember that collection will only occur if the upstream operator (the "producer") is executing in parallel mode and the downstream operator (the "consumer") is executing in sequential mode. Partitioning occurs when the consumer is executing in parallel and the algorithm is other than Same (or Same generated by Auto).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

JWiles and Ray, thank you very much.

Ray, thank you for making the partitioning/collecting piece clear.
JWiles, I did NOT choose a partitioning method on the sequential file stage. DS chose the SortMerge collector.

Side Question:
What is the Orchestrate operator name for the DataSet stage? There isn't one listed in the 'Stage To Operator Mapping' section in the Advanced Developer's Guide.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

The DataSet stage is generally implemented using the Copy operator. The Parallel Job Advanced Job Developer's Guide discusses it a little, but you can also examine the Generated OSH (Job Properties button) after a compile or the osh shown in the job log.

BTW, Sequential File and FileSet stages are implemented using the import and export operators depending on if they are input or output stages.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

Hi all,

Hope someone can give some advice, thanks in advance.

1) How many nodes defined in the config file? 4 or 5? Is there a conductor node?
2) For SMP do we need a conductor node?
3)Is there a process generated by conductor node other than those 13 processes?

Regards.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1) No idea. You have not provided the configuration file nor any information about the use of node pools. But at least four.

2) Yes, and no. You must have at least one node in the default node pool, and the first-named of these (by default) will be where the conductor process executes.

3) Yes. The score shows only player processes. It does not show the conductor and it does not show section leader processes. There is one conductor process for the job. There is one section leader process per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply