RCP capability into server shared container

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
UAUITSBI
Premium Member
Premium Member
Posts: 117
Joined: Thu Aug 13, 2009 3:31 pm
Location: University of Arizona

RCP capability into server shared container

Post by UAUITSBI »

Hello,

I am building a server shared container with a hash file in it. I will be using this container in my parallel jobs to load the hash file and re-use this container as a source in other parallel jobs.

Metadata is different in each of the parallel jobs that I intend to use this container in. I will use a filename parameter to created these different hash files but I am not sure if I can load the hash file with metadata that is dynamic.

In parallel jobs/container we have the capability of Run-Time Column Propagation, is there a way that I can accomplish this task in server shared containers ?

Note: I can use dataset instead of the server shared container but we are trying to accomplish the cache capabilities of the hash file.

Any help is appreciated.

Thanks!
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sorry but... can we start with what "the cache capabilities of the hashed file" means in this context? Don't get me wrong, I cut my teeth on and loves me some hashed files but curious what - as a source and target - you think it gets you over a PX dataset in your PX job?
-craig

"You can never have too many knives" -- Logan Nine Fingers
UAUITSBI
Premium Member
Premium Member
Posts: 117
Joined: Thu Aug 13, 2009 3:31 pm
Location: University of Arizona

Post by UAUITSBI »

Hi chulett,
We have jobs in SAP dataservices that we are trying to convert into parallel jobs in datastage. SAP jobs are writing to persistent cache tables which store the data in memory than a physical table and are used as a source in other jobs.

We are trying to replicate similar process in datastage, we are sourcing from different tables and we want to load the data into a stage that provides caching capabilities. I couldn't find one in parallel jobs and that started me to think about the hash file and server shared containers. Dataset is indeed efficient and could be used as "staging area" because it doesnt offer anything on "caching the data" we want to use hash files instead.

Let me know if i am not clear enough.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm suspecting the caching you are thinking of only applies to hashed files used as a lookup, not as a source or target. Now, technically, the write to a hashed file can be cached. You still have to wait for the cache to flush before the job ends and there's no read caching that I recall outside of a lookup, so still not seeing anything they would buy you here.

Best to stick with datasets, IMHO.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

RCP is a parallel feature, so I doubt you'll ever get it to work with a server shared container.

I'd stick with the parallel dataset.

Someone was asking a while back about using ramdisk for the scratchdisk resource. I don't see why that might not be a viable option for the resource disk that parallel datasets use.

Mike
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

Are you trying to replicate the functionality of a suite of SAP jobs or each individual SAP job.

Perhaps you would be better taking a step back and considering how these cached files are being used in other SAP jobs. That might indicate a variety of options in DataStage of which each may be suited in different circumstances.

For example in SAP where is the data coming from that is cached - files, database, message queue?
How is it used in different jobs - as the main source data, joined with other data, as reference data?
Depending on your answers to these questions you could find that there's no need to "cache" (perhaps persist is a more appropriate term) data because you can read it fast enough from source each time (especially in parallel), or that a dataset is a good option, or a lookup file set.
Once you look across several jobs you may find you can consolidate into a single DataStage job and remove the need to temporarily persist the data anyway.

+1 for sticking with parallel stages in a parallel environment. Parameterise the dataset name and use RCP.

Also bear in mind that depending on the use of these files / datasets or whatever you end up with, the o/s could end up caching the data in memory anyway.
UAUITSBI
Premium Member
Premium Member
Posts: 117
Joined: Thu Aug 13, 2009 3:31 pm
Location: University of Arizona

Post by UAUITSBI »

chulett wrote:I'm suspecting the caching you are thinking of only applies to hashed files used as a lookup, not as a source or target.
Ah okay. I always thought once the data is cached whenever we read from it (either as a source or reference) data will be read from the cache. Now I think about it you might be right, as the read from cache works effectively when joining on keys rather than a straight pull.

Sorry got to this late as we were heading towards the deadline. We've decided to go with the temporary tables which will be a truncate and reload.
UAUITSBI
Premium Member
Premium Member
Posts: 117
Joined: Thu Aug 13, 2009 3:31 pm
Location: University of Arizona

Post by UAUITSBI »

Mike,

Yes RCP is strictly a parallel feature, I was hoping if there were any other techniques that we can to push through the columns dynamically to the hash file but guess there is none.

I agree, parallel datasets are much effective in this case, but as weird as it sounds my client is picky about datasets they ramble about how difficult they are to maintain (deleting entirely from file system etc) I gave up that battle to convince them.

Thanks for your input.
UAUITSBI
Premium Member
Premium Member
Posts: 117
Joined: Thu Aug 13, 2009 3:31 pm
Location: University of Arizona

Post by UAUITSBI »

thompsonp:

I am trying to implement the functionality of each individual SAP jobs, the data is being loaded into Persistent Cache tables which are later being used as reference in other data flows.

We are merely implementing the logic that is there in SAP DSXi on datastage per client request inorder for them to understand the jobs much easier because of the fact that SAP process is in their realm for long time. So we are not able to consolidate any jobs if that is the case then yes the cache can be eventually handled in lot other ways you mentioned.

Thanks for the input.
Post Reply