Lookup character strings

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Lookup character strings

Post by just4geeks »

I have a fixed width flat file containing a field with character data type, such as,

Code: Select all

ab dfa adfa
adie ab dsf
dfe adf abe
I have a lookup flat file which contains just one character field, such as,

Code: Select all

ab
do
so
I need to pull out those records which contain strings from lookup file. And I need to be able to specify if it can be present at the start, in the middle, or at the end. For example, consider the string, 'ab'. If I need to pull out records where ab is present at the start, I now do

Code: Select all

grep '^ab' <file>
. If ab is to be present in the middle, I now do

Code: Select all

grep ' ab ' <file>
. These grep commands have worked in the past because the lookup file was small. Now it was grown to over 18000 records.

How do I implement this in DataStage? The existing lookup stage only looks for exact matches of the lookup field value. But that will not get my expected results.

Any ideas will be appreciated.
Attitude is everything....
Rubu
Premium Member
Premium Member
Posts: 82
Joined: Sun Feb 27, 2005 9:09 pm
Location: Bangalore

Re: Lookup character strings

Post by Rubu »

if i have understood your question correctly, you are trying to pull out data from the first file depending on values in lookup file (e.g. 'ab').

how many columns the first file has. if its limited (3,4) you can have a join stage joining both the files with equating lookup file column with all the columns of the first file and connecting them with OR condition.

Next you can use a remove duplicate state to remove the dups arising due to multiple different look strings present in same row.
Regards
Palas
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Re: Lookup character strings

Post by just4geeks »

Rubu wrote:if i have understood your question correctly, you are trying to pull out data from the first file depending on values in lookup file (e.g. 'ab').

how many columns the first file has. if its limited (3,4) you can have a join stage joining both the files with equating lookup file column with all the columns of the first file and connecting them with OR condition.

Next you can use a remove duplicate state to remove the dups arising due to multiple different look strings present in same row.
You have understood the question correctly. Irrespective of the columns present in the first file, I need to only lookup the first column in the file. I cannot use a join stage since it looks for a perfect match. If the first file has "ab dfa adfa" and the lookup file has "ab", it will not join it. However, I do need that record, "ab dfa adfa" pulled out, since "ab" is present in it.

Let me know if you have any ideas.
Attitude is everything....
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Re: Lookup character strings

Post by Kryt0n »

just4geeks wrote: These grep commands have worked in the past because the lookup file was small. Now it was grown to over 18000 records.
And grep takes a long time on 18000 rows?? I would be surprised if DataStage can process that many rows faster than grep considering it will need a couple of seconds for startup/shutdown. (Or are you simplifying your requirements?)

As for implementing in DS, does your input always have a fixed number of "words" (i.e. columns separated by space)?
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Re: Lookup character strings

Post by just4geeks »

Kryt0n wrote:And grep takes a long time on 18000 rows?? I would be surprised if DataStage can process that many rows faster than grep considering it will need a couple of seconds for startup/shutdown. (Or are you simplifying your requirements?)
I am simplifying my requirements and am not sure how big can the lookup file can get.
Kryt0n wrote:As for implementing in DS, does your input always have a fixed number of "words" (i.e. columns separated by space)?
No, my Input doesn't always have a fixed number of words, though it has a fixed length.
Attitude is everything....
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Sorry for piecemeal ideas, trying to think up ideas for your solution while working on my own...

What's running through my mind is to pivot on this column, run through the lookup doing whatever needs to be done on those that succeed in the lookup. Would then need a dedup at the end to remove any that succeeded in the lookup more than once (if that is possible).

For the start, middle, end requirement, either split the first and last "word" from the rest, and pivot on the middle section. Or after the lookup, check if the lookup value equals the first or last "word" of your starting value. (Although this will mess up the dedup a touch so can't leave the drawing board just yet)

If a row starts abc, should that successfully lookup on your ^ab requirement (or is it ^ab<space>)?
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Post by just4geeks »

Kryt0n wrote:What's running through my mind is to pivot on this column....
I am not sure if I understand you right. It is my understanding that the Pivot stage does something as follows.
Input File:

Code: Select all

a a1 a2 a3
b b1 b2 b3
Output File after Pivoting:

Code: Select all

a a1
a a2
a a3
a a4
b b1
b b2
b b3
b b4
Can you please elaborate on Pivoting a bit more? How do I use Pivot in my problem?
Kryt0n wrote:If a row starts abc, should that successfully lookup on your ^ab requirement (or is it ^ab<space>)?
If the row starts abc, it should successfully lookup on ^ab, just as grep would do.[/quote][/code][/quote]
Attitude is everything....
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I think this one is best done in the database. It's a complex, non-equality join requirement (potentially with three LIKE conditions with OR conjunctions), something DataStage is not really equipped to do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Post by just4geeks »

ray.wurlod wrote:I think this one is best done in the database.
Thanks Ray. I value your opinion. Unfortunately, our entire process is based on flat files, since we do a lot of UNIX command processing. I guess, we will stick to grep.
Attitude is everything....
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Would have to agree there, particularly if you are trying to do LIKE conditions.

As for previous post, the idea was to break your input column (used in the lookup) in to rows, one for each word within the column. You can then lookup on each individually.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Would have to agree there, particularly if you are trying to do LIKE conditions.

As for previous post, the idea was to break your input column (used in the lookup) in to rows, one for each word within the column. You can then lookup on each individually.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Try the ODBC driver for text files. (It might be horribly slow, however.)

Another possibility might be to load your text file data into UniVerse tables and effect the complex join there. (Again this will be slow, because indexes can not assist LIKE joins.) However, you can create I-descriptors (virtual columns) such as the first two characters or the last two characters, and these can be indexed and would assist those kinds of search.

Code: Select all

WHERE SUBSTRING(TableA.column1 FROM 1 FOR 2) = TableB.LeftmostTwoCharacters
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Code: Select all

sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile
Note - prefix and suffix each char in pattern file with space. Or make another sed to do it.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Code: Select all

sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile
Note - prefix and suffix each char in pattern file with space. Or make another sed to do it.

I assume you do not want the "replacement word".
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Post by just4geeks »

Sainath.Srinivasan wrote:

Code: Select all

sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile
I am not sure I understand how can this help me. Can you please elaborate a bit? If I run the above command, a search for '^ab' will yield nothing.
How can I use the same pattern file for different searches, such as,

Code: Select all

grep '^ab'
grep ' ab '
grep 'ab$'
I need to keep the results of these different searches separate.
Attitude is everything....
Post Reply