split string field based on the occurrence of a string

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

split string field based on the occurrence of a string

Post by Marley777 »

:?

Hi, trying to think of a way to look for the occurrence of something. If I find DBA I want to split everything after DBA into a seperate record and get rid of any spaces after DBA.

eample

I have one field that contains

BOB MARLEY DBA JAMAICA ISLAND TOURS

but need to turn it into this (2 records). Spaces are also gone.

BOB MARLEY
JAMAICA ISLAND TOURS

Thanks for reading this.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

What is the metadata of your input, is it one record, one column or multiple records?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

Post by Marley777 »

Hi,

I have lots of records on one file that I'm reading into a transformer. Based on one column one record can become two records. We want to seperate names. If 2 records had DBA somewhere in the name field we would end up with 4 records.

BOB MARLEY DBA JAMAICA ISLAND TOURS
ZIGGY MARLEY DBA ZIGGY'S HERB SHOP

would become
BOB MARLEY
JAMAICA ISLAND TOURS
ZIGGY MARLEY
ZIGGY'S HERB SHOP

4 records. The name field has been split. The rest of the information on the records repeat, but the names will be different.

There can also be space that I need to get rid of
BOB MARLEY DBA <spaces> JAMAICA ISLAND TOURS
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Well, if I were you, I would do it in a shell script using a sed and replace DBA with leading and trailing space with a new line character.
In PX it becomes a little tricky. You need to EREPLACE the " DBA " with a new line character. For that you need to compile and link the PxEreplace() function which is not provided by Ascential (search for its source code here) and then run your stuff.
Id say do it in a server job or a shell script.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you must use parallel expressions in Transformer stage, use Index() to determine where "DBA" occurs, and substrings (possibly with Trim()) to extract what's either side of that.

Putting the Index() function in a stage variable is best, because you need that result more than once, but don't need to calculate it more than once.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Unfortunately for Thirunavukkarasu's idea, the Field() function only uses a single-character delimiter - if more than one is supplied then only the first character in the string is used. So it wouldn't work for
LEONARD COHEN DBA BIRD ON THE WIRE

Try

Code: Select all

sv0 --> Index(InLink.TheString, "DBA", 1)
sv1 --> InLink.TheString[1, sv0 - 1]
sv2 --> InLink.TheString[sv0 + 3, Len(InLink.TheString) - sv0 - 2]

Trim(sv1)
Trim(sv2)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Good solutions. But will only work if DBA occurs only once in a row. For multiple occurances of DBA, it wont work. Or the result set needs to be reprocessed again through the same logic until no literal "DBA" is found in a row.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ras1
Participant
Posts: 1
Joined: Wed Oct 17, 2007 12:10 pm

'DBA'

Post by ras1 »

DBA will only appear once. I'm trying the ideas you all have given me. I will let you know which one works. Thanks for reading and thanks for your help. Feel free to post any new ideas. Thanks!!!
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Unfortunately for Thirunavukkarasu's idea, the Field() function only uses a single-character delimiter - if more than one is supplied then only the first character in the string is used.
Delimiter for a field function can be a string and it works. I am not sure why Ray advises against using this. :?
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

name parsing

Post by Marley777 »

Hi Ray, I looked at your example. Have a question. How can I expand on you logic to handle the following scenario.


I can have more than just 'DBA'. I can have 'C/O' or 'T/A' etc... about ten possibilities. Whenever I encounter one of these ten values the field must be parsed into two fields. I'm hoping I can use stage variables to interrogate the incoming data to see if it needs to be parsed or not. I will have to holding fields for the parsed name.

TOM BRADY SPORTS APPAREL - stays as is, no need to parse because there is no 'DBA' or 'C/O' etc...in the name.

CARSON PALMER INC C/O MIKE VERNON - name should split into two fields then move to output file as separate name records.
CARSON PALMER INC
MIKE VERNON

BOB MARLEY DBA JAMAICA ISLAND TOURS - name should split
BOB MARLEY
JAMAICA ISLAND TOURS

ZIGGY MARLEY T/A ZIGGY'S HERB SHOP - name should split
ZIGGY MARLEY
ZIGGY'S HERB SHOP
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

name parsing

Post by Marley777 »

Hi Ray, I looked at your example. Have a question. How can I expand on you logic to handle the following scenario.


I can have more than just 'DBA'. I can have 'C/O' or 'T/A' etc... about ten possibilities. Whenever I encounter one of these ten values the field must be parsed into two fields. I'm hoping I can use stage variables to interrogate the incoming data to see if it needs to be parsed or not. I will have to holding fields for the parsed name.

TOM BRADY SPORTS APPAREL - stays as is, no need to parse because there is no 'DBA' or 'C/O' etc...in the name.

CARSON PALMER INC C/O MIKE VERNON - name should split into two fields then move to output file as separate name records. I need to also drop C/O.
CARSON PALMER INC
MIKE VERNON

BOB MARLEY DBA JAMAICA ISLAND TOURS - name should split. Also drop DBA.
BOB MARLEY
JAMAICA ISLAND TOURS

ZIGGY MARLEY T/A ZIGGY'S HERB SHOP - name should split. Also drop T/A.
ZIGGY MARLEY
ZIGGY'S HERB SHOP

Thanks for your help.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Assuming that only one of these will occur in any one row, change the derivation of sv0 to

Code: Select all

sv0 --> Index(InLink.TheString, "C/O", 1) + Index(InLink.TheString, "DBA", 1) + Index(InLink.TheString, "T/A", 1) 
et cetera as required.
The Index() function returns 0 if the substring is not found.
The other two stage variable derivation expressions ought not to need to be changed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

Post by Marley777 »

Hi, thanks for your help so far. Everything seems to be working. However there is a new piece to the puzzel. I have been given a spreadsheet that contains hundreds of values (BANK, AGENCY, AGENCY, ACADEMY). I have to know if the name I'm parsing cotains any of these values.

I know I can use
Index(clientData.FULL_NAME, "C/O DBA", 1)

But would have to code this hundreds of times.

my question is- How can I get the spread sheet containing hundreds of values into DS and how can I check the name filed on every record to see if the name field contains anything that is listed in the spread sheet?

Thnaks for your help.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could load all the searchable strings into a key-only UV table (= hashed file) and use user-defined SQL

Code: Select all

SELECT key FROM table WHERE EVAL "Index(?, key, 1)" > 0;
It's ugly, it's slow, but it will work. I can't at the moment think of a better way.
To populate a UV table you need a server job, to use one in a lookup you need either to use a server job, or a server container in a parallel job. Read Chapter 2 of Parallel Job Developer's Guide for restrictions on using server containers in parallel jobs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ejazsalim
Premium Member
Premium Member
Posts: 51
Joined: Wed Apr 09, 2003 6:42 am
Location: VA, USA

Re: split string field based on the occurrence of a string

Post by ejazsalim »

What was the solution to this ???

I had a similar problem and we used QualityStage to resolve it.
Post Reply