DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Wed Aug 29, 2012 8:36 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: Parallel
OS: Windows
Additional info: Its not the first time
Hello All

Code:

                           SQLSrvr1                 SQLSrvr2       
                              |                             |
                              |                             |   
SRC_SQLSrvr-->TFM-->Join1------------------->Join2--------------->Dataset

SRC_SQLSrvr -- Auto Partition
TFM -- Auto
Join1 & SQLSrvr1 -- Hash , BK sort
Join2 & SQLSrvr2 -- SAME , BK sort
Dataset  -- SAME, BK Sort (Which is going to be the INPUT KEY for next job)


I used 4 node Dataset file.

When I ran this job, it went smoothly, whereas in the Dataset (Dataset management) I can see the data only in one node and the rest of the nodes are not having any data?

Dose any one can find anything?

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Wed Aug 29, 2012 9:11 pm Reply with quote    Back to top    

Code:

##I IIS-DSEE-TFCN-00001 13:05:41(000) <main_program>
IBM WebSphere DataStage Enterprise Edition 8.5.0.6152
Copyright (c) 2001, 2005-2008 IBM Corporation. All rights reserved
 


##I IIS-DSEE-TFCN-00006 13:05:41(001) <main_program> conductor uname: -s=Windows_NT; -r=1; -v=6; -n=SERVER; -m=Pentium
##I IIS-DSEE-TFSC-00001 13:05:41(002) <main_program> APT configuration file: /tmp/aptoa79446c1ded2e
##I IIS-DSEE-TOIX-00059 13:05:43(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> Export complete; 58 records exported successfully, 0 rejected.
                 Name:  F:/ETL/Dataset/FILE_tst.ds
              Version:  ORCHESTRATE V8.5.0 DM Block Format 6.
     Time of Creation:  08/30/2012 12:55:19
 Number of Partitions:  4
   Number of Segments:  1
       Valid Segments:  1
Preserve Partitioning:  true
  Partitioning Method:  APT_HashPartitioner
Segment Creation Time:
            0:  08/30/2012 12:55:19

Partition 0
  node   : node1
  records: 9716
  blocks : 37
  bytes  : 4741408
  files  :
    Segment 0 :
           /E=/IBM/Datasets/FILE_tst.ds.SERVER.0000.0000.0000.1ef8.d2081307.0000.d34caebf  4849664 bytes
  total   : 4849664 bytes
Partition 1
  node   : node2
  records: 412646
  blocks : 1540
  bytes  : 201371248
  files  :
    Segment 0 :
           /E=/IBM/Datasets/FILE_tst.ds.SERVER.0000.0001.0000.1ef8.d2081307.0001.9439a8ec  201850880 bytes
  total   : 201850880 bytes
Partition 2
  node   : node3
  records: 0
  blocks : 0
  bytes  : 0
  files  :
    Segment 0 :
           /E=/IBM/Datasets/FILE_tst.ds.SERVER.0000.0002.0000.1ef8.d2081307.0002.02fd46e8  0 bytes
  total   : 0 bytes
Partition 3
  node   : node4
  records: 2481638
  blocks : 9260
  bytes  : 1211039344
  files  :
    Segment 0 :
           /E=/IBM/Datasets/FILE_tst.ds.SERVER.0000.0003.0000.1ef8.d2081307.0003.9c5f3a3c  1213726720 bytes
  total   : 1213726720 bytes

Totals:
  records : 2904000
  blocks  : 10837
  bytes   : 1417152000
  filesize: 1420427264
  min part: 0
  max part: 1213726720

Schema:
record
(
xxxxxxx
)
##I IIS-DSEE-TFSC-00010 13:05:44(000) <main_program> Step execution finished with status = OK.


When i executed this job again , it loaded the data as above.

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 49923
Location: Canberra, Australia
Points: 271338

Post Posted: Thu Aug 30, 2012 12:23 am Reply with quote    Back to top    

Tell us about the partitioning algorithms you've been using.

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong
Delivering Outcomes Together
Rate this response:  
Not yet rated
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Thu Aug 30, 2012 1:21 am Reply with quote    Back to top    

Hi Ray

Code:

                           SQLSrvr1                 SQLSrvr2       
                              |                             |
                              |                             |   
SRC_SQLSrvr-->TFM-->Join1------------------->Join2--------------->Dataset

SRC_SQLSrvr -- Auto Partition
TFM -- Auto
Join1 & SQLSrvr1 -- Hash , BK sort
Join2 & SQLSrvr2 -- SAME , BK sort
Dataset  -- SAME, BK Sort (Which is going to be the INPUT KEY for next job)


Code:

Source is SQL Server - AUTO
TFM                           - AUTO
First Join                    - Hash, SORT on BK for that JOIN
FIRST REF SQL SRVE  -  Hash, SORT on BK for that JOIN
2nd Join                     - SAME, SORT on BK for that JOIN
2nd SQL Srvr              - HASH, SORT on BK for that JOIN
TARGET DATASET       - SAME, SORT on BK for FOR NEXT JOB

In here BK for that JOIN mean each join stage will have different BKs.

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
ShaneMuir



Group memberships:
Premium Members

Joined: 15 Jun 2004
Posts: 384
Location: London
Points: 3055

Post Posted: Thu Aug 30, 2012 1:42 am Reply with quote    Back to top    

Possibly a silly question but is the value that you are partitioning on the same for every record at the first join?
Rate this response:  
Not yet rated
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Thu Aug 30, 2012 6:44 am Reply with quote    Back to top    

For each Join, KEY columns are differing. But i am not re-partitioning, but in the SORT option I am using different columns.

Say for example, in JOIN1 my KEY column is A, whereas in JOIN2 my key column will be B.

For JOIN1, I am using HASH Partition, SORT by column A and in JOIN2 I am using the SAME Partition and SORT by COLUMN B.

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
jwiles



Group memberships:
Premium Members

Joined: 14 Nov 2004
Posts: 1236

Points: 9942

Post Posted: Thu Aug 30, 2012 7:03 am Reply with quote    Back to top    

A key-based partitioning algorithm doesn't guarantee that data will be evenly distributed among the partitions or that data will be sent to all partitions. It merely guarantees that all records with containing the same (identical) value for a key will be sent to the same partition.

Do you have large numbers of records which have the same value for the key?

Is there a reason you are NOT partitioning by Column B for Join 2? It is your primary key for that join, after all Neutral

Regards,

_________________
- james wiles


All generalizations are false, including this one - Mark Twain.
Rate this response:  
Not yet rated
ulab
Participant



Joined: 16 Mar 2009
Posts: 36
Location: bangalore
Points: 224

Post Posted: Thu Aug 30, 2012 7:26 am Reply with quote    Back to top    

Right Said James,

and one more solution could be using a round robin partitioning Ofcourse it degrades the performance in this senario but it makes sure that all the rows are equally distuributed to all nodes, try this once,

IMP NOTE: if the issue is resolved please mark it as resolved this help others and saves time.

_________________
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
Rate this response:  
Not yet rated
jwiles



Group memberships:
Premium Members

Joined: 14 Nov 2004
Posts: 1236

Points: 9942

Post Posted: Thu Aug 30, 2012 10:15 am Reply with quote    Back to top    

If a downstream job needs the data to be key-partitioned, then having 0 rows in one or more partitions of a dataset is not necessarily a bad situation. It's sometimes the price to pay at a point in the process to meet the business rule/logic requirements. If a downstream job doesn't require a particular partitioning scheme, then you can look at round-robin or random partitioning prior to writing out the dataset. Don't round-robin for the joins (I hope that wasn't your suggestion)--that would go against the requirements for the join stage.

Still, I question the use of SAME partitioning prior to the second join...WHY?? The framework could still be inserting a hash partition on column B at runtime to meet the requirement of the join key, when the correct partitioning should really be reflected in the job design instead.

Regards,

_________________
- james wiles


All generalizations are false, including this one - Mark Twain.
Rate this response:  
Not yet rated
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Sun Sep 02, 2012 7:21 pm Reply with quote    Back to top    

jwiles wrote:
Still, I question the use of SAME partitioning prior to the second join...WHY?? The framework could still be inserting a hash partition on column B at runtime to meet the requirement of the join key, when the correct partitioning should really be reflected in the job design instead.
Regards,

Thanks for the reply.

Do you mean, i need to use the HASH partition explicitly rather than SAME in the JOIN2 ?

When the chek the DUMP_SCORE, it is internally placing the HASH.

What is the difference between using SAME Vs explicitly choosing the partition?

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
jwiles



Group memberships:
Premium Members

Joined: 14 Nov 2004
Posts: 1236

Points: 9942

Post Posted: Mon Sep 03, 2012 1:08 pm Reply with quote    Back to top    

Read this documentation about Same partitioning. You should then understand what Same partitioning does and how it differs from other partitioning.

Had partition insertion been disabled in your job or project, your second join would not have produced the results you required because your data would not have been partitioned correctly for that join operation.

Regards,

_________________
- james wiles


All generalizations are false, including this one - Mark Twain.
Rate this response:  
Not yet rated
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Wed Sep 05, 2012 8:03 pm Reply with quote    Back to top    

Sorry for the late reply.

Quote:
In theory YES I know how this will work, but I may confused &/ may not good enough understood the concept might be the reason!

In my view, first I made the Join1 as HASH partition and SORT the data for the SPECIFIC KEY for that JOIN1.

My understanding -->was<--, if I use the SAME PARTITION in the JON2, then DATASTAGE will do the HASH partition (which i chose in the previous join) based on the new KEY what i am going to choose for JOIN2.

Quote:
Same partitioner

The stage using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning method. This is normally the method InfoSphere® DataStage® uses when passing data between stages in your job

Does it mean, If the KEY column is going to be same for the downstream STAGE, should use the SAME Partition?

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
jwiles



Group memberships:
Premium Members

Joined: 14 Nov 2004
Posts: 1236

Points: 9942

Post Posted: Wed Sep 05, 2012 10:07 pm Reply with quote    Back to top    

You CAN use same partitioning, but you don't have to. You could leave it as Auto instead and DataStage would more than likely not repartition the data because it already meets the needs of the downstream stage.

Regards,

_________________
- james wiles


All generalizations are false, including this one - Mark Twain.
Rate this response:  
Not yet rated
SURA



Group memberships:
Premium Members

Joined: 14 Jul 2007
Posts: 883

Points: 6066

Post Posted: Wed Sep 05, 2012 10:57 pm Reply with quote    Back to top    

Thanks James. Now i understood.

But the thread was initiated to know, why data was not written in all the nodes?

Quote:
Hash partitioning does not necessarily result in an even distribution of data between partitions.

Dose it mean, it may not even write a single row in one more nodes?

_________________
Thanks
DS User
----------------------------------
evidence driven decisions and performance driven outcomes
Rate this response:  
Not yet rated
jwiles



Group memberships:
Premium Members

Joined: 14 Nov 2004
Posts: 1236

Points: 9942

Post Posted: Wed Sep 05, 2012 11:02 pm Reply with quote    Back to top    

This question was already answered earlier in the thread:

Quote:
Posted: Thu Aug 30, 2012 7:03 am
A key-based partitioning algorithm doesn't guarantee that data will be evenly distributed among the partitions or that data will be sent to all partitions. It merely guarantees that all records with containing the same (identical) value for a key will be sent to the same partition.

Regards,

_________________
- james wiles


All generalizations are false, including this one - Mark Twain.
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours