Lookup character strings

just4geeks · Post by **just4geeks** » Sat Mar 07, 2009 11:11 pm

I have a fixed width flat file containing a field with character data type, such as,

ab dfa adfa
adie ab dsf
dfe adf abe

I have a lookup flat file which contains just one character field, such as,

Code: Select all

ab
do
so

I need to pull out those records which contain strings from lookup file. And I need to be able to specify if it can be present at the start, in the middle, or at the end. For example, consider the string, 'ab'. If I need to pull out records where ab is present at the start, I now do

Code: Select all

grep '^ab' <file>

. If ab is to be present in the middle, I now do

Code: Select all

grep ' ab ' <file>

. These grep commands have worked in the past because the lookup file was small. Now it was grown to over 18000 records.

How do I implement this in DataStage? The existing lookup stage only looks for exact matches of the lookup field value. But that will not get my expected results.

Any ideas will be appreciated.

Rubu · Post by **Rubu** » Sun Mar 08, 2009 4:21 am

if i have understood your question correctly, you are trying to pull out data from the first file depending on values in lookup file (e.g. 'ab').

how many columns the first file has. if its limited (3,4) you can have a join stage joining both the files with equating lookup file column with all the columns of the first file and connecting them with OR condition.

Next you can use a remove duplicate state to remove the dups arising due to multiple different look strings present in same row.

just4geeks · Post by **just4geeks** » Sun Mar 08, 2009 6:51 am

Rubu wrote:if i have understood your question correctly, you are trying to pull out data from the first file depending on values in lookup file (e.g. 'ab').

how many columns the first file has. if its limited (3,4) you can have a join stage joining both the files with equating lookup file column with all the columns of the first file and connecting them with OR condition.

Next you can use a remove duplicate state to remove the dups arising due to multiple different look strings present in same row.

You have understood the question correctly. Irrespective of the columns present in the first file, I need to only lookup the first column in the file. I cannot use a join stage since it looks for a perfect match. If the first file has "ab dfa adfa" and the lookup file has "ab", it will not join it. However, I do need that record, "ab dfa adfa" pulled out, since "ab" is present in it.

Let me know if you have any ideas.

Kryt0n · Post by **Kryt0n** » Sun Mar 08, 2009 4:57 pm

just4geeks wrote: These grep commands have worked in the past because the lookup file was small. Now it was grown to over 18000 records.

And grep takes a long time on 18000 rows?? I would be surprised if DataStage can process that many rows faster than grep considering it will need a couple of seconds for startup/shutdown. (Or are you simplifying your requirements?)

As for implementing in DS, does your input always have a fixed number of "words" (i.e. columns separated by space)?

just4geeks · Post by **just4geeks** » Sun Mar 08, 2009 5:05 pm

Kryt0n wrote:And grep takes a long time on 18000 rows?? I would be surprised if DataStage can process that many rows faster than grep considering it will need a couple of seconds for startup/shutdown. (Or are you simplifying your requirements?)

I am simplifying my requirements and am not sure how big can the lookup file can get.

Kryt0n wrote:As for implementing in DS, does your input always have a fixed number of "words" (i.e. columns separated by space)?

No, my Input doesn't always have a fixed number of words, though it has a fixed length.

Kryt0n · Post by **Kryt0n** » Sun Mar 08, 2009 5:46 pm

Sorry for piecemeal ideas, trying to think up ideas for your solution while working on my own...

What's running through my mind is to pivot on this column, run through the lookup doing whatever needs to be done on those that succeed in the lookup. Would then need a dedup at the end to remove any that succeeded in the lookup more than once (if that is possible).

For the start, middle, end requirement, either split the first and last "word" from the rest, and pivot on the middle section. Or after the lookup, check if the lookup value equals the first or last "word" of your starting value. (Although this will mess up the dedup a touch so can't leave the drawing board just yet)

If a row starts abc, should that successfully lookup on your ^ab requirement (or is it ^ab<space>)?

just4geeks · Post by **just4geeks** » Sun Mar 08, 2009 6:00 pm

Kryt0n wrote:What's running through my mind is to pivot on this column....

I am not sure if I understand you right. It is my understanding that the Pivot stage does something as follows.
Input File:

Code: Select all

a a1 a2 a3
b b1 b2 b3

Output File after Pivoting:

Code: Select all

a a1
a a2
a a3
a a4
b b1
b b2
b b3
b b4

Can you please elaborate on Pivoting a bit more? How do I use Pivot in my problem?

Kryt0n wrote:If a row starts abc, should that successfully lookup on your ^ab requirement (or is it ^ab<space>)?

If the row starts abc, it should successfully lookup on ^ab, just as grep would do.[/quote][/code][/quote]

ray.wurlod · Post by **ray.wurlod** » Sun Mar 08, 2009 6:24 pm

I think this one is best done in the database. It's a complex, non-equality join requirement (potentially with three LIKE conditions with OR conjunctions), something DataStage is not really equipped to do.

just4geeks · Post by **just4geeks** » Sun Mar 08, 2009 7:20 pm

ray.wurlod wrote:I think this one is best done in the database.

Thanks Ray. I value your opinion. Unfortunately, our entire process is based on flat files, since we do a lot of UNIX command processing. I guess, we will stick to grep.

Kryt0n · Post by **Kryt0n** » Sun Mar 08, 2009 7:37 pm

Would have to agree there, particularly if you are trying to do LIKE conditions.

As for previous post, the idea was to break your input column (used in the lookup) in to rows, one for each word within the column. You can then lookup on each individually.

Kryt0n · Post by **Kryt0n** » Sun Mar 08, 2009 7:39 pm

Would have to agree there, particularly if you are trying to do LIKE conditions.

As for previous post, the idea was to break your input column (used in the lookup) in to rows, one for each word within the column. You can then lookup on each individually.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 08, 2009 7:43 pm

Try the ODBC driver for text files. (It might be horribly slow, however.)

Another possibility might be to load your text file data into UniVerse tables and effect the complex join there. (Again this will be slow, because indexes can not assist LIKE joins.) However, you can create I-descriptors (virtual columns) such as the first two characters or the last two characters, and these can be indexed and would assist those kinds of search.

Code: Select all

WHERE SUBSTRING(TableA.column1 FROM 1 FOR 2) = TableB.LeftmostTwoCharacters

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Mar 09, 2009 4:16 am

Code: Select all

sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile

Note - prefix and suffix each char in pattern file with space. Or make another sed to do it.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Mar 09, 2009 4:17 am

Code: Select all

sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile

Note - prefix and suffix each char in pattern file with space. Or make another sed to do it.

I assume you do not want the "replacement word".

just4geeks · Post by **just4geeks** » Mon Mar 09, 2009 7:34 am

Sainath.Srinivasan wrote:
Code: Select all
sed -e 's/^/ /' -e 's/$/ /' sourceDataFile | egrep -f patternFile

I am not sure I understand how can this help me. Can you please elaborate a bit? If I run the above command, a search for '^ab' will yield nothing.
How can I use the same pattern file for different searches, such as,

Code: Select all

grep '^ab'
grep ' ab '
grep 'ab$'

I need to keep the results of these different searches separate.

DSXchange

Lookup character strings

Lookup character strings

Re: Lookup character strings

Re: Lookup character strings

Re: Lookup character strings

Re: Lookup character strings