Constant feed to DS Job

dsdoubt · Post by **dsdoubt** » Fri Feb 29, 2008 12:52 pm

Hi,
My current project setup is as follows.
Set of files need to be processed. Though each file is of less number of rows, large number of files need to be processed.
So We have set of Perl program to preprocess and get the files from different server and drop in datastage box. As soon as the files are dropped, the Datastge will start the process and finish all the dropped files. Once this is done, other segment of Datastge will be started. After all this, the initial perl will get anothre set of files to Datastage.
This takes a lot of time.
Since the perl calls the DS jobs, it need to maintain the list of file names and pass it as parameters. So cant do the first stage when the second stage is going on.
And more over, startup time for each job each time is around 5 Sec. Where as production is just 2-3 second.
So is there a way to make Datastage wait or listen to a port or directory always, and as and when the file comes, make DS job to run. Will it work if we use named pipe options.
I guess there is some functionlity in Version 8 right.

kcbland · Post by **kcbland** » Fri Feb 29, 2008 1:14 pm

Using a named pipe is great, you'll have to deal with timeout situations. Folks sometimes periodically send a "heartbeat" row to a pipe to keep it live. I think this is too much gimmick but that's my opinion.

I suppose concatenation of files is not an option? This would give a larger block of processing to give more credence to the micro-batch approach.

I personally think a staging database helps out in these situations much better. You can be appending rows to the table as you're reading rows out. Rows can be updating with a status indicating it's been "inducted", "processed" or "rejected". Your micro-batches get larger because they can span multiple files (now just rows within the table). You gain a significant amount of functionality (retry, audit, elasticity in staging).

dsdoubt · Post by **dsdoubt** » Fri Feb 29, 2008 1:26 pm

Thanks for reply.
Is it any dummy row that you refering to as HeartBeat. So we should reject that row with some conditional check isn't?
So if we have a Name pipe and make the job to listen to that pipe, will the job always be running, even when the data is not available in that pipe.
Coz, the data from the previous stage may be be accumulated at all time. Will be available at span of time.
Like the files will be avilable from morning to evening with some time period. Each file will be processed by each job within seconds. After that, the job will be idle.

kcbland · Post by **kcbland** » Fri Feb 29, 2008 1:51 pm

You get the idea... :D

ray.wurlod · Post by **ray.wurlod** » Fri Feb 29, 2008 3:48 pm

You might consider using server jobs, or parallel jobs with a low degree of parallelism, to keep the startup time as short as possible.

Another possibility is an "always running" job using WISD to publish your DataStage job as a web service that the Perl application could invoke. In this case startup time would not be an issue.

dsdoubt · Post by **dsdoubt** » Fri Feb 29, 2008 11:09 pm

Is it the part of Version 8?
Is there any documentation avaialable regarding the functionality and performance boost that we get in V8 if we upgrade?

ray.wurlod · Post by **ray.wurlod** » Sat Mar 01, 2008 2:46 am

What makes you think that you'll get a performance boost? Indeed, how do you define "performance" in such a vague context.

You certainly get a functionality boost - quite a few new toys and a common Repository that can be shared (to some extent) with other tools. Therefore arguably there is a potential boost to developer productivity to be had - does this count as "performance"?