i am new to datastage ,
i know defintion of data cleansing, but how we perfrom this in real time datastage projects, can anybody explain with some examples?
data cleansing
Moderators: chulett, rschirm, roy
You don't have data cleaning facility in Datastage, for this, take a look at QualityStage...
But, in Datastage you can do "Data Quality", ie. analysis of your data in order to produce reports (unicity, count, etc. ...) and apply business rules (in transformer, based on routines or using reference) and eventually take decision based on those results...
Anyway, it is just data transformation, like the T in ETL...
But, in Datastage you can do "Data Quality", ie. analysis of your data in order to produce reports (unicity, count, etc. ...) and apply business rules (in transformer, based on routines or using reference) and eventually take decision based on those results...
Anyway, it is just data transformation, like the T in ETL...
I can give you a simple definition for Data Cleansing. Its like Querying your DBMS with select col names so that you wont get unwanted data.
Data Cleansing it at various levels at the least grain size is at row which can be achvd. by using simple Transformer job.
At the highest level is at an aggregated grain size which is possible by using a aggregator.
Even in Source or target stage also you can specific the specific columns you are interested in.
I appreciate someone, comment on this.
Kommu
Data Cleansing it at various levels at the least grain size is at row which can be achvd. by using simple Transformer job.
At the highest level is at an aggregated grain size which is possible by using a aggregator.
Even in Source or target stage also you can specific the specific columns you are interested in.
I appreciate someone, comment on this.
Kommu
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Most data cleansing in DataStage is done through a transformer or a QualityStage plugin.
In a transformer cleansing is performed through BASIC code, constraints or lookups.
- BASIC code lets you convert the data, for example trimming blanks, checking a text field for a valid date and converting it to internal date format, removing invalid characters etc. Almost any cleansing business rule can be written in BASIC code.
- Constraints let you filter invalid data and send it to a reject file. If BASIC cleansing code finds a problem a constraint can turn it into a reject row.
- Lookups can be used to validate codes or foreign key relationships.
- QualityStage is the best option for text parsing such as breaking down a single address string, or breaking names into first name and surname. There are a huge number of cleansing functions in this product. By using the DataStage plugin you can retrieve clean or rejected records out of the QualityStage job back into the DataStage job for further processing.
In a transformer cleansing is performed through BASIC code, constraints or lookups.
- BASIC code lets you convert the data, for example trimming blanks, checking a text field for a valid date and converting it to internal date format, removing invalid characters etc. Almost any cleansing business rule can be written in BASIC code.
- Constraints let you filter invalid data and send it to a reject file. If BASIC cleansing code finds a problem a constraint can turn it into a reject row.
- Lookups can be used to validate codes or foreign key relationships.
- QualityStage is the best option for text parsing such as breaking down a single address string, or breaking names into first name and surname. There are a huge number of cleansing functions in this product. By using the DataStage plugin you can retrieve clean or rejected records out of the QualityStage job back into the DataStage job for further processing.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
To keep it short, DataStage 'transforms' (i.e. changes) the data as per your need. So if you already know a data quality problem, you can leave, rectify or escalate it using DataStage.
But it is not a tool to 'identify' any unknown quality problem. To do that, you need some other tool such as QualityStage or plain SQL.
In the presence of low quality data, DataStage will either return garbage or fail, depending on how the jobs are designed. Hence you may be better off performing simple quality checks in the source system periodically to analyse any possible quality defects and confirm that meets the quality standards.
That is why taking periodic backups are important in case you miss something.
But it is not a tool to 'identify' any unknown quality problem. To do that, you need some other tool such as QualityStage or plain SQL.
In the presence of low quality data, DataStage will either return garbage or fail, depending on how the jobs are designed. Hence you may be better off performing simple quality checks in the source system periodically to analyse any possible quality defects and confirm that meets the quality standards.
That is why taking periodic backups are important in case you miss something.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Even better for identifying data problems are ProfileStage (formerly known as MetaRecon) or AuditStage (formerly known as Quality Manager).
This is exactly what those tools do.
This is exactly what those tools do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi,
Well Profile Stage only gives you a data statistics of actual data present in your data sources with no change options.
Quality Stage on the other hand will let you Identify look alike, soundex and pattern matching of data which is not equal but might represent the same data and standardise then, in the process you might even deduplicate multiple entries of the same actual data even if not written in the same mannor.
they say Quality Stage has around 75-80% (If my memory serves me)standardisation out of the box or in a short time of development (around 2 weeks) depending on if (if you'll need localization) and how extensive will the localization process take if not implemented already.
IHTH,
Well Profile Stage only gives you a data statistics of actual data present in your data sources with no change options.
Quality Stage on the other hand will let you Identify look alike, soundex and pattern matching of data which is not equal but might represent the same data and standardise then, in the process you might even deduplicate multiple entries of the same actual data even if not written in the same mannor.
they say Quality Stage has around 75-80% (If my memory serves me)standardisation out of the box or in a short time of development (around 2 weeks) depending on if (if you'll need localization) and how extensive will the localization process take if not implemented already.
IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom