data cleansing

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
harithay
Participant
Posts: 106
Joined: Tue Dec 14, 2004 10:51 pm

data cleansing

Post by harithay »

i am new to datastage ,

i know defintion of data cleansing, but how we perfrom this in real time datastage projects, can anybody explain with some examples?
Alexandre
Participant
Posts: 12
Joined: Fri Dec 31, 2004 5:04 am
Location: Paris
Contact:

Post by Alexandre »

You don't have data cleaning facility in Datastage, for this, take a look at QualityStage...

But, in Datastage you can do "Data Quality", ie. analysis of your data in order to produce reports (unicity, count, etc. ...) and apply business rules (in transformer, based on routines or using reference) and eventually take decision based on those results...

Anyway, it is just data transformation, like the T in ETL...
kommven
Charter Member
Charter Member
Posts: 125
Joined: Mon Jul 12, 2004 12:37 pm

Post by kommven »

I can give you a simple definition for Data Cleansing. Its like Querying your DBMS with select col names so that you wont get unwanted data.

Data Cleansing it at various levels at the least grain size is at row which can be achvd. by using simple Transformer job.
At the highest level is at an aggregated grain size which is possible by using a aggregator.

Even in Source or target stage also you can specific the specific columns you are interested in.

I appreciate someone, comment on this.

Kommu
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Most data cleansing in DataStage is done through a transformer or a QualityStage plugin.

In a transformer cleansing is performed through BASIC code, constraints or lookups.
- BASIC code lets you convert the data, for example trimming blanks, checking a text field for a valid date and converting it to internal date format, removing invalid characters etc. Almost any cleansing business rule can be written in BASIC code.
- Constraints let you filter invalid data and send it to a reject file. If BASIC cleansing code finds a problem a constraint can turn it into a reject row.
- Lookups can be used to validate codes or foreign key relationships.
- QualityStage is the best option for text parsing such as breaking down a single address string, or breaking names into first name and surname. There are a huge number of cleansing functions in this product. By using the DataStage plugin you can retrieve clean or rejected records out of the QualityStage job back into the DataStage job for further processing.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

To keep it short, DataStage 'transforms' (i.e. changes) the data as per your need. So if you already know a data quality problem, you can leave, rectify or escalate it using DataStage.

But it is not a tool to 'identify' any unknown quality problem. To do that, you need some other tool such as QualityStage or plain SQL.

In the presence of low quality data, DataStage will either return garbage or fail, depending on how the jobs are designed. Hence you may be better off performing simple quality checks in the source system periodically to analyse any possible quality defects and confirm that meets the quality standards.

That is why taking periodic backups are important in case you miss something.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Even better for identifying data problems are ProfileStage (formerly known as MetaRecon) or AuditStage (formerly known as Quality Manager).

This is exactly what those tools do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
Well Profile Stage only gives you a data statistics of actual data present in your data sources with no change options.
Quality Stage on the other hand will let you Identify look alike, soundex and pattern matching of data which is not equal but might represent the same data and standardise then, in the process you might even deduplicate multiple entries of the same actual data even if not written in the same mannor.
they say Quality Stage has around 75-80% (If my memory serves me)standardisation out of the box or in a short time of development (around 2 weeks) depending on if (if you'll need localization) and how extensive will the localization process take if not implemented already.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Yep. I mistyped quality manager into quality stage. So to rephrase - you need quality manager or metarecon.

The tool integrity is more to reformat any flat file to bring to a general understanding and pattern.
Post Reply