Page 1 of 2

Job using dataset files is slower than sequential files

Posted: Wed Feb 07, 2007 11:43 am
by splayer
I created 2 sets of jobs. Both sets are exactly identical. Both sets have loops. Here is the loop code:

StartLoop --> ExecCmd1 --> JobActivity --> ExecCmd2 --> EndLoop

There is another link from the EndLoop to StartLoop. The job in the JobAcitivity stage is:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile

The difference between the 2 sets is that the TargetFile is a sequential file in one and data set file in the other. The set of jobs with sequential file is significantly faster than the set with data set files. I would think that just the reverse should be true. Is there any config file manipulation that I can do?

Posted: Wed Feb 07, 2007 12:24 pm
by us1aslam1us
Do Job Monitoring and check whether the move from transformer to Target file (DataSet) taking more time or the overall process? Is it done in sequential mode.

Posted: Wed Feb 07, 2007 12:49 pm
by patonp

Posted: Wed Feb 07, 2007 10:49 pm
by splayer
I did job monitoring. There is nothing specific that I can fiind there. I tried changing the config file from a 2 node file to a 4 node file. It does not split the transformer processing into 4 nodes which is kind of strange.

Posted: Wed Feb 07, 2007 11:07 pm
by kumar_s
Is environment idle for both the case? You server might be loaded on the later case.
There may be a chance where the additional node might be be easily accessible by dataset to write the data into. For testing, you can try the give single node where the sequential file is created.

Posted: Thu Feb 08, 2007 12:27 am
by splayer
kumar_s, can you elaborate a little bit? What does "environment idle" mean? Mine is a dev environment and I have just one box but 4 processors, from what I know. If I use 4 nodes, shouldn't I see 4 instances for the transformer in job monitor window?

Posted: Thu Feb 08, 2007 12:32 am
by balajisr
Are you by any chance running the transformer in sequential mode?

Posted: Thu Feb 08, 2007 12:48 am
by kumar_s
Your server might be busy with other stuff when you are testing with dataset and might be comparatively idle when you process sequential file. This will make your dataset preparation to run slower. You can measure the CPU usage on both the cases.

Posted: Thu Feb 08, 2007 3:55 am
by ArndW
As balajisr mentioned, you might be running sequentially instead of in parallel. Turn on APT_DUMP_SCORE to see what is really happening at runtime. A dataset, even in sequential mode, should run at about the same speed as a sequential file for what you've described, so something is definately amiss.

Posted: Thu Feb 08, 2007 10:20 am
by splayer
No, I am running in parallel as I am seeing xfm x 2 in the monitor where xfm is the transformer. I set APT_DUMP_SCORE to True. I don't see anything additional in the log. Shouldn't I see the output in the log?

Posted: Thu Feb 08, 2007 1:26 pm
by splayer
These are the outputs from the dump score for the sequential and dataset file versions. For 20 source files, the sequeantial
file version takes 48 secs and the dataset file version takes 68 secs.

Sequential File version(2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:sk_Add_SrcID)}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
>>eCollectAny
op2[1p] (sequential APT_RealFileExportOperator in MasterFile)}
It has 3 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(sk_Add_SrcID)
(APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[1p] {(sequential APT_RealFileExportOperator in MasterFile)
on nodes (
node2[op2,p0]
)}
It runs 4 processes on 2 nodes.


Dataset file version (2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)}
ds1: {op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
=>
/Fld1/Fld2/Fld3/MyDS.ds}
It has 2 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
It runs 3 processes on 2 nodes.

Posted: Thu Feb 08, 2007 4:15 pm
by ArndW
What happens to the speed of the sequential version if you output it to the same directory as your dataset data files as specified in the APT_CONFIG file? If is slows down then it might be related to the disk partition and not directly to the DS job.

Posted: Thu Feb 08, 2007 4:27 pm
by splayer
My APT_CONFIG_FILE is located in /home/dsadm/Ascential/DataStage/Configurations. Both versions output the file to the same folder.

This does not make sense to me. I would think that performance would at least be same. I am having doubts about the need of datasets now. I am not seeing any benefit other than being able to store larger files on my multiple disks.

Posted: Thu Feb 08, 2007 11:06 pm
by balajisr
What is your partitioning type when you load into dataset?

Posted: Thu Feb 08, 2007 11:16 pm
by kumar_s
Arnd, Even in sequential mode Dataset, shouldn't be quicker than the Sequential file, atleast theoretically? Dataset will be written in native format and not necessary to convert into Ascii.
splayer, more over recording benchmark for data worth of processing within few seconds will not give out exact result. Check for startup time and production time for each case. Because these will be interms of seconds.