REMOVE DUPLICATE WORDS

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
ejazsalim
Premium Member
Premium Member
Posts: 51
Joined: Wed Apr 09, 2003 6:42 am
Location: VA, USA

REMOVE DUPLICATE WORDS

Post by ejazsalim »

how do I remove words that are repeated more than once in a string

Example :
Input : JOHN DOE AND MARY DOE DBA JOHN INC

Output : JOHN DOE AND MARY DBA INC

Right now I am using DataStage to split all words and then get unique values was wondering if there was a simple way to do this in QualityStage was able to use pattern language but seems to be quite cumbersome.

Any suggestions ??
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can use PAL but, as you note, it will be quite cumbersome. You'll need to have the first PrimaryName already copied, then do a condtional pattern to handle the second PrimaryName. The condition, of course, will be an equality test; if it is satisfied reclassify the second PrimaryName to the Null class (0). You will also have to handle the inequality case - probably move the entire second name to Additional Name Info. Other solutions almost certainly exist; that's how I'd approach it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

I guess that you can use reverse floating positioning specifier or fix position specifier

Classify the token that you want to remove ...let's say to a class R


Below pattern will make the right most R token null

*R | $
retype [1] 0


You can used below with a routine and a REPEAT clause to make null all r token after the first one

%2R
retype [1] 0
REPEAT
Last edited by JRodriguez on Tue Dec 22, 2009 3:59 pm, edited 1 time in total.
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
ejazsalim
Premium Member
Premium Member
Posts: 51
Joined: Wed Apr 09, 2003 6:42 am
Location: VA, USA

Post by ejazsalim »

Thanks Ray/Rod I dont have much control over sorce data so I cannot CLASSIFY it. I think I will keep typing the PAL solution since there doesn't seems to be an easy way out
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

How does it currently classify it?

Input : JOHN DOE AND MARY DOE DBA JOHN INC

Output : JOHN DOE AND MARY DBA INC

This example also looks like 2 or 3 separate pieces of information that make sense when parsed correctly and not just chopped up.

John Doe and Mary Doe - people's names, obviously

The rest could be something like:
DBA John Inc - Company name

or

DBA - a position description
John Inc - company name

If that example is indicative of the data you have in there, I'd be asking more questions about how they're using the field to work out what you should do.

Maybe you need a prep ruleset first to split it up (If the DBA is a position description, then that would be a finite number of values that could be used to crack the whole thing wide open).
If there are 2 or 3 pieces of separate information there, then they need to be split and parsed individually.
Otherwise, anything you might do to chop this or that out will only corrupt your data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

:oops: I completely missed noticing the DBA !
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ejazsalim
Premium Member
Premium Member
Posts: 51
Joined: Wed Apr 09, 2003 6:42 am
Location: VA, USA

Post by ejazsalim »

DBA = Doing Business As

there are a lot of other keywords like DBA embeded in the data.

What I am trying to do is remove all the first names and middle names and all the known keywords (like DBA/POD) and have a list of unknown words and also eliminate the duplicate words and then try to create a LOOSE match for exception handling (sorry if this is confusing)

right now I am writing a pattern file to get to the unique word

Follow up question

Is it possible to search for a variable in a string ?

;INPUT -- JOHN DOE DBA JOE PIZZA DBA PIZZA TO GO

&
COPY "DBA" vKeyWord01

*&=vKeyWord01|** ; WHAT IS THE RIGHT SYNTAX ?

*&="DBA" | ** ; THIS WORKS BUT NOT WHAT I NEED

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

These kinds of words (like "TA" for "trading as") can be classified into a suitable class. That will make the parsing and pattern-action easier to implement. You may need more patterns to handle things like "T/A".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

So to get this clear in my head: You want a unique list of unknown words and then (and this bit I understand less than the rest of it) use them to make some sort of key for matching?
Well, here goes nothing...

The unknown words bit is easy. Put everything you want removed into a classification file and then for every defined type, do the following:

;Here for F. Repeat for every type you defined.
0*F
RETYPE [1] 0

Deduping the unknown words within the PAT file will be a pain.
Off the top of my head (insert disclaimer here), something like this should work:

; Take note of the token you're trying to dedupe.
& | &
COPY [1] temp
RETYPE [1] 0

; Look for a second instance of it
; May also need to do this one a couple of times if the same unknown word shows up more than twice.
** | & [{} = temp] | [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0

** | & [{} = temp] | [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0


Repeat the 2nd one a couple of times to form a block, then repeat the block, until you get what you want.
In the end you may have 1 + left:


& | $ [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0

& | $ [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0



Still don't get the point though...
dsqspro
Premium Member
Premium Member
Posts: 20
Joined: Wed Apr 15, 2009 7:01 am

Post by dsqspro »

You will be needing a lot of data analysis before defining the what king of parsing rules are needed because just fixing your current name pattern might not fix entire data cleansing required before standardization or matching.

Step one- Identify know data patterns and unknown data patterns with example data.

Step two- show user and take their recommendations and define high level rules.

Step three- build QS jobs to properly place data into respective buckets.

like Prefix, First Name, Middle Name, Last Name, Suffix, Additional Name.
Post Reply