Literal being added to the Data when trying the PREP ruleset

skadiam · Post by **skadiam** » Tue Feb 21, 2012 3:50 am

Hi, I was trying ot enter literals like ZQNAMEZQ or ZQADDRZQ before the fields when trying the PREP rulesets. After running the job the literal is being added to the data in some cases and is working fine in some other cases. I would like to know, whether there is any specific order in which we define the literals to the fields. Can anyone please help me in providing some details on this.

stuartjvnorton · Post by **stuartjvnorton** » Tue Feb 21, 2012 10:08 pm

You need to add it before each of the input columns you are putting through the PREP set.
IIRC, you also have 6 internal fields inside the PREP set (ie you can only add 6 of these directives), so if you have more input columns than that you'll need to work out where the best fit is to fit all of your input columns into the 6.

skadiam · Post by **skadiam** » Wed Feb 22, 2012 5:30 am

Yes that is fine.. let me give an example..for US PREP
I have 11 fields AddLine1,AddLine2,AddLine3,City,Stage,Zipcode,UnparsedAdd1,UnparsedAdd2,UnparsedAdd3,UnparsedAdd4,UnparsedAdd5

The logic is like data in UnparsedAdd1-5 will not be available when there is data in AddLine1-3 and City,Stage,Zipcode

I have given as below

ZQNAMEZQ before AddLine1 and UnparsedAdd1,
ZQNAMEZQ before AddLine2,
ZQADDRZQ before AddLine3,
ZQAREAZQ before UnparsedAdd2
ZQAREAZQ before UnparsedAdd3 & City,Stage,Zipcode
ZQAREAZQ befor UnparsedAdd4
ZQAREAZQ befor UnparsedAdd5

In the output for some values in the Name Domain, I see 'ZQ' appearing in between the AddLine1 and AddLine2 data.

Eg. AddLine1 has 'ABC' and AddLine2 has 'DEF' then in the NameDomain I see data as ABC ZQ DEF

Is there any specific reason for this, and how should I correct this.

Also is there any order that we need to follow while giving the literals like ZQNameZQ before ZQADDRZQ and ZQADDRZQ before ZQAREAZQ. And how does this literal stuff actualll work?

rjdickson · Post by **rjdickson** » Wed Feb 22, 2012 5:50 pm

Take a look at http://publib.boulder.ibm.com/infocente ... _file.html for a reasonable explanation.

You can only have up to 6 delimiters, and you have 7. Can you try combining UnparsedAdd4 and UnparsedAdd5 to the sixth one?

I hope that helps!

stuartjvnorton · Post by **stuartjvnorton** » Wed Feb 22, 2012 6:00 pm

In a nutshell, it uses the different ZQ delimiters to tell it what context it should use to look at the data that follows.

Haven't looked at the code in a while, but it probably starts at the back looks for the delimiter. When it finds it, it takes the text that follows and puts in in a variable.
There are 2 variables for each internal "field": 1 for the text and 1 for the delimiter.
If it gets to 5 delimiters, it probably then takes the rest of the data and stores it as the field text. The fact you have a ZQ in the middle of the first one is consistent with that: you have 7 delimiters listed, and the standardised form of any of the valid delimiters is "ZQ".

Once it has the fields, it goes through each one and gets the word pattern based on the hint in the delimiter, then puts the bits in the relevant domain buckets.

Looking at your inout, you might be able to bunch them up a little more (seeing as you know which the rules for when data is present in which field):
ZQNAMEZQ AddLine1 AddLine2 UnparsedAdd1
ZQADDRZQ AddLine3 UnparsedAdd2
ZQAREAZQ UnparsedAdd3 UnparsedAdd4 UnparsedAdd5 City State Zipcode

The thing to bunching or splitting them is that if you have profiled and know what's in there, you can make intelligent decisions over how much you can or should bunch the data.
Split it up too much and individual fields may not have enough context to make good decisions (or you run out of fields).
Bunch it up too much and you have more complex patterns that you have to code for or you start getting mixed domain inputs that (from my limited experience with the PREP set) don't get split up as well as you'd like.

Hope this helps.