Business rules
Business rules
I am a newbie in using IA. I set up a project and analyzed a particular column I am interested in. I was hoping that the tool would categorize the similar spelled values for me. But I don't see an option to get that info.
What I am looking for is,
I have a column named "color" which has values like
BLACK
BLK
01 BLACK
..etc
I want these results to be categorized as probable similar values. How can I achieve this?. Do I have to go the route of writing business rules for this?.
How do I get started with learning the basics of writing business rules?.
Thanks.
What I am looking for is,
I have a column named "color" which has values like
BLACK
BLK
01 BLACK
..etc
I want these results to be categorized as probable similar values. How can I achieve this?. Do I have to go the route of writing business rules for this?.
How do I get started with learning the basics of writing business rules?.
Thanks.
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
The job of IA is to analyse data, not modify it.
Since you've already done the column analysis, you could either create a mapping table if the list is fairly static.
You could use QualityStage to create a ruleset and do some standardisation, but if you're just looking at standardising a finite list of colours then this would be overkill. The spelling similarity wouldn't really work either because the colour names and abbreviations are short.
Since you've already done the column analysis, you could either create a mapping table if the list is fairly static.
You could use QualityStage to create a ruleset and do some standardisation, but if you're just looking at standardising a finite list of colours then this would be overkill. The spelling similarity wouldn't really work either because the colour names and abbreviations are short.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I concur entirely with Stuart. IA has no inherent concept of meaning; it gives one frequency distributions of values, formats, etc., but it is not the tool even for identifying similar spelling (though sorting by value in the analysis results may help somewhat in making these visible).
You (and/or a subject matter expert) then need to make a decision about which similarly-spelled items represent the same thing. (For example, how will you treat the value "BL"?) And then you construct some mechanism like a lookup table or a QualityStage classification or reference table or classification overrides to implement the same.
You (and/or a subject matter expert) then need to make a decision about which similarly-spelled items represent the same thing. (For example, how will you treat the value "BL"?) And then you construct some mechanism like a lookup table or a QualityStage classification or reference table or classification overrides to implement the same.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ok. I understand .
I tried doing a basic report for "color" and simultaneously ran a query in the database to count the distinct values and their counts. The table I am analysing is huge. It has 18 million records and 1 million distinct values for color.
But when I do "Frequency by Frequency" report under the "column frequency" optoin, the report says
Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305
I do not understand why it just looks at 70% of the data and not 100%. Also in the report it displays only 1000 distinct values and their counts;even though it says the cardinality is 1,034,710 in the summary.Why is not diplaying the whole data ?
I tried doing a basic report for "color" and simultaneously ran a query in the database to count the distinct values and their counts. The table I am analysing is huge. It has 18 million records and 1 million distinct values for color.
But when I do "Frequency by Frequency" report under the "column frequency" optoin, the report says
Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305
I do not understand why it just looks at 70% of the data and not 100%. Also in the report it displays only 1000 distinct values and their counts;even though it says the cardinality is 1,034,710 in the summary.Why is not diplaying the whole data ?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Chances are that the other 30% are null, and/or that your analysis settings have limited the number of rows analyzed. Likewise your display options may limit display of results to 1000 rows in frequency reports. A quick way would be to sort by frequency - if you see values that weren't in the report when it was sorted the other way, then there is a filter in operation.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
LOL I'm still back on how this "colour" field has over 1 million distinct values...
You also do not want to create a frequency by frequency to display the whole list. If it even comes back after the day and a half it would take to put that sucker together, no one will ever look at a 30k page pdf.
Instead, go into QualityStage and create an empty ruleset. You can put some basic colours in if you like, but that's not really important.
Then create a word investigate job using the ruleset and select both token and pattern reports. Tick all of the options, for displaying unknown terms etc (as most of them are currently unknown).
You can create some output fields, but they don't need to model it accurately yet.
Run the job and look first at the token results, then classify the most common ones, including abbreviations. BLK -> BLACK, BLCK -> BLACK, etc
Then look at the most common patterns and check stuff like having numbers in there and punctuation. With that many distinct values, there is bound to be a lot of combinations of information and junk, codes, punctuation: you name it.
- Do numbers or alphanum tokens mean anything? ie are they codes you can map to some reference data afterwards in DataStage?
- Can you make your life easier by removing punctuation to reduce the number of patterns you have to work with in a way that doesn't change the meaning of the data? (that last bit is paramount)
Start understanding the common patterns and work out how to handle them: which bits you need and which bits are irrelevant. It will change with each pattern.
Add some starter rules to split up the most common patterns.
Work out what you want to do if you can't get something useful out of the data. Is that ok, or not?
Run the job again and repeat a few times to narrow your focus as you go.
Obviously the amount of time you send on this depends on how important accurate colour interpretation is to you.
You also do not want to create a frequency by frequency to display the whole list. If it even comes back after the day and a half it would take to put that sucker together, no one will ever look at a 30k page pdf.
Instead, go into QualityStage and create an empty ruleset. You can put some basic colours in if you like, but that's not really important.
Then create a word investigate job using the ruleset and select both token and pattern reports. Tick all of the options, for displaying unknown terms etc (as most of them are currently unknown).
You can create some output fields, but they don't need to model it accurately yet.
Run the job and look first at the token results, then classify the most common ones, including abbreviations. BLK -> BLACK, BLCK -> BLACK, etc
Then look at the most common patterns and check stuff like having numbers in there and punctuation. With that many distinct values, there is bound to be a lot of combinations of information and junk, codes, punctuation: you name it.
- Do numbers or alphanum tokens mean anything? ie are they codes you can map to some reference data afterwards in DataStage?
- Can you make your life easier by removing punctuation to reduce the number of patterns you have to work with in a way that doesn't change the meaning of the data? (that last bit is paramount)
Start understanding the common patterns and work out how to handle them: which bits you need and which bits are irrelevant. It will change with each pattern.
Add some starter rules to split up the most common patterns.
Work out what you want to do if you can't get something useful out of the data. Is that ok, or not?
Run the job again and repeat a few times to narrow your focus as you go.
Obviously the amount of time you send on this depends on how important accurate colour interpretation is to you.
-
- Participant
- Posts: 16
- Joined: Mon Jul 28, 2014 10:37 pm
Sorry for the late response. Please note that the original poster of this is the same person as me.(dsusersaj=tempdsuser). My premium account is inactive and I am working to get it activated.
So I executed this job as you mentioned above. As I am very new to quality stage, I am finding it hard to decipher the patterns. Some of the patterns I get are like this
qsInvColumnName,qsInvPattern,qsInvSample,qsInvCount,qsInvPercent
FR_COLOR,,,3.86930700000000000E+06,2.1451508E+01
FR_COLOR,^, 0000,2.12979300000000000E+06,1.1807611E+01
FR_COLOR,^^,0 0,4.91400000000000000E+03,2.7243305E-02
FR_COLOR,^^^,009 233 04,2.98000000000000000E+02,1.6521174E-03
I am trying to figure out the pattern meaning from user guide. Any other documentation that I can refer?.
Thanks for your help
So I executed this job as you mentioned above. As I am very new to quality stage, I am finding it hard to decipher the patterns. Some of the patterns I get are like this
qsInvColumnName,qsInvPattern,qsInvSample,qsInvCount,qsInvPercent
FR_COLOR,,,3.86930700000000000E+06,2.1451508E+01
FR_COLOR,^, 0000,2.12979300000000000E+06,1.1807611E+01
FR_COLOR,^^,0 0,4.91400000000000000E+03,2.7243305E-02
FR_COLOR,^^^,009 233 04,2.98000000000000000E+02,1.6521174E-03
I am trying to figure out the pattern meaning from user guide. Any other documentation that I can refer?.
Thanks for your help
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
The ones you have put here are for an empty field, then 1, 2, and 3 numbers. The million-odd distinct values is making a bit more sense. Do you get a lot of terms in the token report? Might be worth trying couple of times to get just the unknown words, then the numbers too, etc.
The ^ is a Pattern Action Language symbol for a number.
You can find out about this in the Pattern Action Reference.
The ^ is a Pattern Action Language symbol for a number.
You can find out about this in the Pattern Action Reference.
-
- Participant
- Posts: 16
- Joined: Mon Jul 28, 2014 10:37 pm
Thanks for your reply. This is starting to make sense now.
I am now trying to figure out what the Token report means.
Some of the values in that are as below
qsInvCount,qsInvWord,qsInvClassCode
3.00000000000000000E+00, ,
3.60000000000000000E+01,`,~
1.00000000000000000E+00,````````,~
2.00000000000000000E+00,`````````,~
I am now trying to figure out what the Token report means.
Some of the values in that are as below
qsInvCount,qsInvWord,qsInvClassCode
3.00000000000000000E+00, ,
3.60000000000000000E+01,`,~
1.00000000000000000E+00,````````,~
2.00000000000000000E+00,`````````,~
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
A token is any group of characters that are bundled together after the list of separator characters have been applied to the text.
QS stan works by first separating the text into tokens and then stripping the unwanted characters, by applying the SepList and StripList.
When you read text, we use spaces and punctuation as separators. So does QS, but you get to control which characters are used.
Then you can strip out any junk characters (but be careful with what you consider "junk").
Then it classes each token. Are they words? Numbers? a combination of the 2? Are they words with specific meaning in this context? These are the classes.
QS has single character codes for each class. These are also in the Pattern Action Reference, somewhere in the first 5-10 pages.
QS stan works by first separating the text into tokens and then stripping the unwanted characters, by applying the SepList and StripList.
When you read text, we use spaces and punctuation as separators. So does QS, but you get to control which characters are used.
Then you can strip out any junk characters (but be careful with what you consider "junk").
Then it classes each token. Are they words? Numbers? a combination of the 2? Are they words with specific meaning in this context? These are the classes.
QS has single character codes for each class. These are also in the Pattern Action Reference, somewhere in the first 5-10 pages.
-
- Participant
- Posts: 16
- Joined: Mon Jul 28, 2014 10:37 pm
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You can not. Your reading is inaccurate. Data Rules can be shared between Information Analyzer and DataStage, in particular the Data Rules stage. Data Rule definitions can be created in either tool, and published into the repository to make them available to the other tool (and to Metadata Workbench).
That is not to say that Data Rules stage could not be incorporated into a QualityStage job flow, but the Data Rules do not contribute to QualityStage standardisation, mapping or survivorship.
That is not to say that Data Rules stage could not be incorporated into a QualityStage job flow, but the Data Rules do not contribute to QualityStage standardisation, mapping or survivorship.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 16
- Joined: Mon Jul 28, 2014 10:37 pm
Can you please make the premium content visible to me?. Restoration of my premium account (dsusersaj) is still up in the air . No one has updated me on it.ray.wurlod wrote:You can not. Your reading is inaccurate. Data Rules can be shared between Information Analyzer and DataStage, in particular the Data Rules stage. Data Rule definitions can be created in either tool ...
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
Ray was saying that published IA rule definitions can be used in the Data Rules stage in DataStage jobs.
You take a published rule definition, bind it to the relevant data available to the job, select your outputs and run the job.
You can also set up the results to go to the DQ Console if you have your DataStage project set up as a source in the DQ Console.
These have nothing whatsoever to do with QS rule sets.
You take a published rule definition, bind it to the relevant data available to the job, select your outputs and run the job.
You can also set up the results to go to the DQ Console if you have your DataStage project set up as a source in the DQ Console.
These have nothing whatsoever to do with QS rule sets.
-
- Participant
- Posts: 16
- Joined: Mon Jul 28, 2014 10:37 pm