Lookup character strings

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

I think earlier I did not understand your requirement if full.

If you need to identify and split into different sets, maybe you can create a temporary pattern file copy as

Code: Select all

sed -e 's/^/\^/' yourPatternFile > tempPatternFile
sed -e 's/^/ /' -e 's/$/ /' yourPatternFile >> tempPatternFile
sed -e 's/$/ /' yourPatternFile >> tempPatternFile
And then do

Code: Select all

egrep -f tempPatternFile yourSourceDataFile
Alternatively you can split the incoming source data rows by replacing all "blank" with "blank + newline + blank" and also add @INROWNUM & trim(blank-de-blank) column to it.

Thus the line
This is a test line
will become
1 Thisb This
1 bisb is
1 bab a
1 btestb test
1 bline line

where b is blank character
You can then do you pattern matching using lookup with the third column.

This way, you get the line number which relates to the row, the original data as it appears in the row and the trimmed data to match. This will give you all analysis.
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Post by just4geeks »

s_boyapati wrote:In DataStage lookup stage........
Another way is after lookup,.......
Thanks for your suggestions. But how will the lookup stage pull out these records in the first place? For example, if the field on the input stream is
'adfa ad are ab adr' and the lookup string is 'ab', the lookup stage will not pull it out since it doesn't match perfectly. Further, I cannot split the field on the input stream into sub-fields as the number of white spaces vary between records.
Attitude is everything....
s_boyapati
Premium Member
Premium Member
Posts: 70
Joined: Thu Aug 14, 2003 6:24 am
Contact:

Re: Lookup character strings

Post by s_boyapati »

just4geeks wrote:I have a fixed width flat file containing a field with character data type, such as,

Code: Select all

ab dfa adfa
adie ab dsf
dfe adf abe
I have a lookup flat file which contains just one character field, such as,

Code: Select all

ab
do
so
I need to pull out those records which contain strings from lookup file. And I need to be able to specify if it can be present at the start, in the middle, or at the end. For example, consider the string, 'ab'. If I need to pull out records where ab is present at the start, I now do

Code: Select all

grep '^ab' <file>
. If ab is to be present in the middle, I now do

Code: Select all

grep ' ab ' <file>
. These grep commands have worked in the past because the lookup file was small. Now it was grown to over 18000 records.

How do I implement this in DataStage? The existing lookup stage only looks for exact matches of the lookup field value. But that will not get my expected results.

Any ideas will be appreciated.
In DataStage lookup stage you can do equijoin only, Use EXEC in After job routine, that should call shell script with file name as parameter to shell script. In Shell script use AWK or PERL skills to use first field ( Search key) to be searched in second fileld(Input record). Use PERL instead of AWK, if you like and comfortability.

Another way (Probably I am not sure about, depends on reqs ), use transformer where use Substring and index functions to find where the matching field located in string. You must use staging variables to get accurate results.

Hope this helps...
Sree Boyapati
Sr. ETL Architect
Certified Developer in DataStage, QualityStage, Information Analyzer.
Post Reply