Constant feed to DS Job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsdoubt
Participant
Posts: 106
Joined: Sat Jul 15, 2006 12:17 am

Constant feed to DS Job

Post by dsdoubt »

Hi,
My current project setup is as follows.
Set of files need to be processed. Though each file is of less number of rows, large number of files need to be processed.
So We have set of Perl program to preprocess and get the files from different server and drop in datastage box. As soon as the files are dropped, the Datastge will start the process and finish all the dropped files. Once this is done, other segment of Datastge will be started. After all this, the initial perl will get anothre set of files to Datastage.
This takes a lot of time.
Since the perl calls the DS jobs, it need to maintain the list of file names and pass it as parameters. So cant do the first stage when the second stage is going on.
And more over, startup time for each job each time is around 5 Sec. Where as production is just 2-3 second.
So is there a way to make Datastage wait or listen to a port or directory always, and as and when the file comes, make DS job to run. Will it work if we use named pipe options.
I guess there is some functionlity in Version 8 right.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Using a named pipe is great, you'll have to deal with timeout situations. Folks sometimes periodically send a "heartbeat" row to a pipe to keep it live. I think this is too much gimmick but that's my opinion.

I suppose concatenation of files is not an option? This would give a larger block of processing to give more credence to the micro-batch approach.

I personally think a staging database helps out in these situations much better. You can be appending rows to the table as you're reading rows out. Rows can be updating with a status indicating it's been "inducted", "processed" or "rejected". Your micro-batches get larger because they can span multiple files (now just rows within the table). You gain a significant amount of functionality (retry, audit, elasticity in staging).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
dsdoubt
Participant
Posts: 106
Joined: Sat Jul 15, 2006 12:17 am

Post by dsdoubt »

Thanks for reply.
Is it any dummy row that you refering to as HeartBeat. So we should reject that row with some conditional check isn't?
So if we have a Name pipe and make the job to listen to that pipe, will the job always be running, even when the data is not available in that pipe.
Coz, the data from the previous stage may be be accumulated at all time. Will be available at span of time.
Like the files will be avilable from morning to evening with some time period. Each file will be processed by each job within seconds. After that, the job will be idle.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You get the idea... :D
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You might consider using server jobs, or parallel jobs with a low degree of parallelism, to keep the startup time as short as possible.

Another possibility is an "always running" job using WISD to publish your DataStage job as a web service that the Perl application could invoke. In this case startup time would not be an issue.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsdoubt
Participant
Posts: 106
Joined: Sat Jul 15, 2006 12:17 am

Post by dsdoubt »

Is it the part of Version 8?
Is there any documentation avaialable regarding the functionality and performance boost that we get in V8 if we upgrade?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What makes you think that you'll get a performance boost? Indeed, how do you define "performance" in such a vague context.

You certainly get a functionality boost - quite a few new toys and a common Repository that can be shared (to some extent) with other tools. Therefore arguably there is a potential boost to developer productivity to be had - does this count as "performance"?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply