Page 1 of 1

split string field based on the occurrence of a string

Posted: Tue Jan 29, 2008 2:18 pm
by Marley777
:?

Hi, trying to think of a way to look for the occurrence of something. If I find DBA I want to split everything after DBA into a seperate record and get rid of any spaces after DBA.

eample

I have one field that contains

BOB MARLEY DBA JAMAICA ISLAND TOURS

but need to turn it into this (2 records). Spaces are also gone.

BOB MARLEY
JAMAICA ISLAND TOURS

Thanks for reading this.

Posted: Tue Jan 29, 2008 2:25 pm
by DSguru2B
What is the metadata of your input, is it one record, one column or multiple records?

Posted: Tue Jan 29, 2008 2:36 pm
by Marley777
Hi,

I have lots of records on one file that I'm reading into a transformer. Based on one column one record can become two records. We want to seperate names. If 2 records had DBA somewhere in the name field we would end up with 4 records.

BOB MARLEY DBA JAMAICA ISLAND TOURS
ZIGGY MARLEY DBA ZIGGY'S HERB SHOP

would become
BOB MARLEY
JAMAICA ISLAND TOURS
ZIGGY MARLEY
ZIGGY'S HERB SHOP

4 records. The name field has been split. The rest of the information on the records repeat, but the names will be different.

There can also be space that I need to get rid of
BOB MARLEY DBA <spaces> JAMAICA ISLAND TOURS

Posted: Tue Jan 29, 2008 2:38 pm
by DSguru2B
Well, if I were you, I would do it in a shell script using a sed and replace DBA with leading and trailing space with a new line character.
In PX it becomes a little tricky. You need to EREPLACE the " DBA " with a new line character. For that you need to compile and link the PxEreplace() function which is not provided by Ascential (search for its source code here) and then run your stuff.
Id say do it in a server job or a shell script.

Posted: Tue Jan 29, 2008 9:42 pm
by ray.wurlod
If you must use parallel expressions in Transformer stage, use Index() to determine where "DBA" occurs, and substrings (possibly with Trim()) to extract what's either side of that.

Putting the Index() function in a stage variable is best, because you need that result more than once, but don't need to calculate it more than once.

Posted: Wed Jan 30, 2008 1:06 am
by ray.wurlod
Unfortunately for Thirunavukkarasu's idea, the Field() function only uses a single-character delimiter - if more than one is supplied then only the first character in the string is used. So it wouldn't work for
LEONARD COHEN DBA BIRD ON THE WIRE

Try

Code: Select all

sv0 --> Index(InLink.TheString, "DBA", 1)
sv1 --> InLink.TheString[1, sv0 - 1]
sv2 --> InLink.TheString[sv0 + 3, Len(InLink.TheString) - sv0 - 2]

Trim(sv1)
Trim(sv2)

Posted: Wed Jan 30, 2008 2:00 pm
by DSguru2B
Good solutions. But will only work if DBA occurs only once in a row. For multiple occurances of DBA, it wont work. Or the result set needs to be reprocessed again through the same logic until no literal "DBA" is found in a row.

'DBA'

Posted: Wed Jan 30, 2008 2:19 pm
by ras1
DBA will only appear once. I'm trying the ideas you all have given me. I will let you know which one works. Thanks for reading and thanks for your help. Feel free to post any new ideas. Thanks!!!

Posted: Wed Jan 30, 2008 11:41 pm
by balajisr
Unfortunately for Thirunavukkarasu's idea, the Field() function only uses a single-character delimiter - if more than one is supplied then only the first character in the string is used.
Delimiter for a field function can be a string and it works. I am not sure why Ray advises against using this. :?

name parsing

Posted: Thu Jan 31, 2008 11:49 am
by Marley777
Hi Ray, I looked at your example. Have a question. How can I expand on you logic to handle the following scenario.


I can have more than just 'DBA'. I can have 'C/O' or 'T/A' etc... about ten possibilities. Whenever I encounter one of these ten values the field must be parsed into two fields. I'm hoping I can use stage variables to interrogate the incoming data to see if it needs to be parsed or not. I will have to holding fields for the parsed name.

TOM BRADY SPORTS APPAREL - stays as is, no need to parse because there is no 'DBA' or 'C/O' etc...in the name.

CARSON PALMER INC C/O MIKE VERNON - name should split into two fields then move to output file as separate name records.
CARSON PALMER INC
MIKE VERNON

BOB MARLEY DBA JAMAICA ISLAND TOURS - name should split
BOB MARLEY
JAMAICA ISLAND TOURS

ZIGGY MARLEY T/A ZIGGY'S HERB SHOP - name should split
ZIGGY MARLEY
ZIGGY'S HERB SHOP

name parsing

Posted: Thu Jan 31, 2008 11:51 am
by Marley777
Hi Ray, I looked at your example. Have a question. How can I expand on you logic to handle the following scenario.


I can have more than just 'DBA'. I can have 'C/O' or 'T/A' etc... about ten possibilities. Whenever I encounter one of these ten values the field must be parsed into two fields. I'm hoping I can use stage variables to interrogate the incoming data to see if it needs to be parsed or not. I will have to holding fields for the parsed name.

TOM BRADY SPORTS APPAREL - stays as is, no need to parse because there is no 'DBA' or 'C/O' etc...in the name.

CARSON PALMER INC C/O MIKE VERNON - name should split into two fields then move to output file as separate name records. I need to also drop C/O.
CARSON PALMER INC
MIKE VERNON

BOB MARLEY DBA JAMAICA ISLAND TOURS - name should split. Also drop DBA.
BOB MARLEY
JAMAICA ISLAND TOURS

ZIGGY MARLEY T/A ZIGGY'S HERB SHOP - name should split. Also drop T/A.
ZIGGY MARLEY
ZIGGY'S HERB SHOP

Thanks for your help.

Posted: Thu Jan 31, 2008 4:32 pm
by ray.wurlod
Assuming that only one of these will occur in any one row, change the derivation of sv0 to

Code: Select all

sv0 --> Index(InLink.TheString, "C/O", 1) + Index(InLink.TheString, "DBA", 1) + Index(InLink.TheString, "T/A", 1) 
et cetera as required.
The Index() function returns 0 if the substring is not found.
The other two stage variable derivation expressions ought not to need to be changed.

Posted: Fri Feb 22, 2008 1:10 pm
by Marley777
Hi, thanks for your help so far. Everything seems to be working. However there is a new piece to the puzzel. I have been given a spreadsheet that contains hundreds of values (BANK, AGENCY, AGENCY, ACADEMY). I have to know if the name I'm parsing cotains any of these values.

I know I can use
Index(clientData.FULL_NAME, "C/O DBA", 1)

But would have to code this hundreds of times.

my question is- How can I get the spread sheet containing hundreds of values into DS and how can I check the name filed on every record to see if the name field contains anything that is listed in the spread sheet?

Thnaks for your help.

Posted: Fri Feb 22, 2008 5:30 pm
by ray.wurlod
You could load all the searchable strings into a key-only UV table (= hashed file) and use user-defined SQL

Code: Select all

SELECT key FROM table WHERE EVAL "Index(?, key, 1)" > 0;
It's ugly, it's slow, but it will work. I can't at the moment think of a better way.
To populate a UV table you need a server job, to use one in a lookup you need either to use a server job, or a server container in a parallel job. Read Chapter 2 of Parallel Job Developer's Guide for restrictions on using server containers in parallel jobs.

Re: split string field based on the occurrence of a string

Posted: Tue Mar 18, 2008 8:50 pm
by ejazsalim
What was the solution to this ???

I had a similar problem and we used QualityStage to resolve it.