Business rules

dsusersaj · Post by **dsusersaj** » Mon Jun 23, 2014 5:19 pm

I am a newbie in using IA. I set up a project and analyzed a particular column I am interested in. I was hoping that the tool would categorize the similar spelled values for me. But I don't see an option to get that info.

What I am looking for is,

I have a column named "color" which has values like
BLACK
BLK
01 BLACK
..etc

I want these results to be categorized as probable similar values. How can I achieve this?. Do I have to go the route of writing business rules for this?.

How do I get started with learning the basics of writing business rules?.

Thanks.

stuartjvnorton · Post by **stuartjvnorton** » Mon Jun 23, 2014 6:21 pm

The job of IA is to analyse data, not modify it.

Since you've already done the column analysis, you could either create a mapping table if the list is fairly static.

You could use QualityStage to create a ruleset and do some standardisation, but if you're just looking at standardising a finite list of colours then this would be overkill. The spelling similarity wouldn't really work either because the colour names and abbreviations are short.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 23, 2014 7:10 pm

I concur entirely with Stuart. IA has no inherent concept of meaning; it gives one frequency distributions of values, formats, etc., but it is not the tool even for identifying similar spelling (though sorting by value in the analysis results may help somewhat in making these visible).

You (and/or a subject matter expert) then need to make a decision about which similarly-spelled items represent the same thing. (For example, how will you treat the value "BL"?) And then you construct some mechanism like a lookup table or a QualityStage classification or reference table or classification overrides to implement the same.

dsusersaj · Post by **dsusersaj** » Wed Jul 09, 2014 2:56 pm

ok. I understand .

I tried doing a basic report for "color" and simultaneously ran a query in the database to count the distinct values and their counts. The table I am analysing is huge. It has 18 million records and 1 million distinct values for color.

But when I do "Frequency by Frequency" report under the "column frequency" optoin, the report says

Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305

I do not understand why it just looks at 70% of the data and not 100%. Also in the report it displays only 1000 distinct values and their counts;even though it says the cardinality is 1,034,710 in the summary.Why is not diplaying the whole data ?

ray.wurlod · Post by **ray.wurlod** » Wed Jul 09, 2014 3:56 pm

Chances are that the other 30% are null, and/or that your analysis settings have limited the number of rows analyzed. Likewise your display options may limit display of results to 1000 rows in frequency reports. A quick way would be to sort by frequency - if you see values that weren't in the report when it was sorted the other way, then there is a filter in operation.

stuartjvnorton · Post by **stuartjvnorton** » Wed Jul 09, 2014 5:58 pm

LOL I'm still back on how this "colour" field has over 1 million distinct values...

You also do not want to create a frequency by frequency to display the whole list. If it even comes back after the day and a half it would take to put that sucker together, no one will ever look at a 30k page pdf.

Instead, go into QualityStage and create an empty ruleset. You can put some basic colours in if you like, but that's not really important.

Then create a word investigate job using the ruleset and select both token and pattern reports. Tick all of the options, for displaying unknown terms etc (as most of them are currently unknown).

You can create some output fields, but they don't need to model it accurately yet.

Run the job and look first at the token results, then classify the most common ones, including abbreviations. BLK -> BLACK, BLCK -> BLACK, etc

Then look at the most common patterns and check stuff like having numbers in there and punctuation. With that many distinct values, there is bound to be a lot of combinations of information and junk, codes, punctuation: you name it.
- Do numbers or alphanum tokens mean anything? ie are they codes you can map to some reference data afterwards in DataStage?
- Can you make your life easier by removing punctuation to reduce the number of patterns you have to work with in a way that doesn't change the meaning of the data? (that last bit is paramount)

Start understanding the common patterns and work out how to handle them: which bits you need and which bits are irrelevant. It will change with each pattern.
Add some starter rules to split up the most common patterns.

Work out what you want to do if you can't get something useful out of the data. Is that ok, or not?

Run the job again and repeat a few times to narrow your focus as you go.

Obviously the amount of time you send on this depends on how important accurate colour interpretation is to you.

tempdsuser · Post by **tempdsuser** » Fri Aug 01, 2014 3:14 pm

Sorry for the late response. Please note that the original poster of this is the same person as me.(dsusersaj=tempdsuser). My premium account is inactive and I am working to get it activated.

So I executed this job as you mentioned above. As I am very new to quality stage, I am finding it hard to decipher the patterns. Some of the patterns I get are like this

qsInvColumnName,qsInvPattern,qsInvSample,qsInvCount,qsInvPercent
FR_COLOR,,,3.86930700000000000E+06,2.1451508E+01
FR_COLOR,^, 0000,2.12979300000000000E+06,1.1807611E+01
FR_COLOR,^^,0 0,4.91400000000000000E+03,2.7243305E-02
FR_COLOR,^^^,009 233 04,2.98000000000000000E+02,1.6521174E-03

I am trying to figure out the pattern meaning from user guide. Any other documentation that I can refer?.

Thanks for your help

stuartjvnorton · Post by **stuartjvnorton** » Sat Aug 02, 2014 9:51 am

The ones you have put here are for an empty field, then 1, 2, and 3 numbers. The million-odd distinct values is making a bit more sense. Do you get a lot of terms in the token report? Might be worth trying couple of times to get just the unknown words, then the numbers too, etc.

The ^ is a Pattern Action Language symbol for a number.
You can find out about this in the Pattern Action Reference.

tempdsuser · Post by **tempdsuser** » Tue Aug 05, 2014 5:21 pm

Thanks for your reply. This is starting to make sense now.

I am now trying to figure out what the Token report means.

Some of the values in that are as below

qsInvCount,qsInvWord,qsInvClassCode
3.00000000000000000E+00, ,
3.60000000000000000E+01,`,~
1.00000000000000000E+00,````````,~
2.00000000000000000E+00,`````````,~

stuartjvnorton · Post by **stuartjvnorton** » Tue Aug 05, 2014 6:40 pm

A token is any group of characters that are bundled together after the list of separator characters have been applied to the text.

QS stan works by first separating the text into tokens and then stripping the unwanted characters, by applying the SepList and StripList.
When you read text, we use spaces and punctuation as separators. So does QS, but you get to control which characters are used.
Then you can strip out any junk characters (but be careful with what you consider "junk").

Then it classes each token. Are they words? Numbers? a combination of the 2? Are they words with specific meaning in this context? These are the classes.
QS has single character codes for each class. These are also in the Pattern Action Reference, somewhere in the first 5-10 pages.

tempdsuser · Post by **tempdsuser** » Thu Sep 04, 2014 4:18 pm

I remember reading somewhere that the business rules created in IA could be shared between IA and Quality Stage.

Our BA's would like to create business rules in IA and want me to experiment if I can import those to qualitystage. Any insight on this?.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 04, 2014 4:45 pm

You can not. Your reading is inaccurate. Data Rules can be shared between Information Analyzer and DataStage, in particular the Data Rules stage. Data Rule definitions can be created in either tool, and published into the repository to make them available to the other tool (and to Metadata Workbench).

That is not to say that Data Rules stage could not be incorporated into a QualityStage job flow, but the Data Rules do not contribute to QualityStage standardisation, mapping or survivorship.

tempdsuser · Post by **tempdsuser** » Thu Sep 04, 2014 5:09 pm

ray.wurlod wrote:You can not. Your reading is inaccurate. Data Rules can be shared between Information Analyzer and DataStage, in particular the Data Rules stage. Data Rule definitions can be created in either tool ...

Can you please make the premium content visible to me?. Restoration of my premium account (dsusersaj) is still up in the air . No one has updated me on it.

stuartjvnorton · Post by **stuartjvnorton** » Thu Sep 04, 2014 7:11 pm

Ray was saying that published IA rule definitions can be used in the Data Rules stage in DataStage jobs.
You take a published rule definition, bind it to the relevant data available to the job, select your outputs and run the job.
You can also set up the results to go to the DQ Console if you have your DataStage project set up as a source in the DQ Console.

These have nothing whatsoever to do with QS rule sets.

tempdsuser · Post by **tempdsuser** » Fri Sep 05, 2014 10:11 am

Yes actually this is the information I was looking for .But I mentioned "business rules" in my previous post by mistake. I see that stage. Let me explore how I can do this. Do you have pointers to any documentation on how to set this stage up?.

Thanks!