How to validate or QA your data warehouse

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

How to validate or QA your data warehouse

Post by kduke »

We have been going through this lately and it brought to memory a few tips I thought I would suggest. These are just some random thoughts. Maybe I will organize them later.

1. Start with the easy fields.

Sometimes the hard fields will fall out when you validate something simple. Maybe use a few temp tables to select into. Total all invoice amounts for a month. See if they match. Select the same but group by invoice number into a temp table on both source and targets. Now get a list of all invoice numbers where they do not match.

2. Validate measures first.

If you have a large fact table then validate one day by summing one column in the source against one in the fact table. Maybe do it by one dimension. So the numbers come back quickly. If you can sum one column and compare to the source then this is the easiest.

3. Validate simple to complex on dimensions.

Count records in source and target and compare then get more a more complex. If it is a SCD type 2 then count dimension records where they are still active.

4. Now calculate if the correct dimension key is in the fact table.

This is more complex if your dimensions were difficult to get into the fact table. We had Sharon the Data Queen. She probably built 25 jobs to pull in all the dimension keys. No way to join all that on the source plus to drill into the source tables was not simple. I learned a lot from Sharon.

5. Automate the simple checks in the ETL.

This builds confidence in your work by the end users. Confidence is a big plus.

Simple job:
Select count(*) from SourceTable => hash file => "S_":#TargetTable#

Second job:
Select count(*) from TargetTable => hash file => "T_":#TargetTable#

Make the key the either S_TargetTable or T_TargetTable. Now compare.

Now loop through and calculate completeness or the percent of records which made it to the warehouse. Email the results to the users.

If you always get 98% of a source then that also builds confidence. You should of justified a long time ago why the 2% did not get loaded.

6. Never assume something like a cost field comes from a cost column in the source.

On legacy home grown applications a cost maybe a negative number stored in a revenue column. The connection between what a legacy term means and what you think it means is important. Call it whatever the user is used to unless this conflicts from EDW perspective. Consistency is important but if a parent company wants it named differently then you maybe stuck with a bad name. Data Dictionary is critical to translate from what it was originally called to the new name or translate it into parent company speak. Data Dictionary should be available on line. It should be small or separated from larger documents. There are parts of data warehouse documents which are more useful and might be needed a lot until the warehouse reports become mainstream.

Please share your practical tips.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Some more tips. If you have the source keys and a dblink back to the source then you can validate a field by select the source keys in the target table and join them to the source table where target.column <> source.column.

Very easy. Count these and then do outer joins to see if there are missing keys.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Once you have validated the columns this way and this will only validate a straight copy of a field then row counts are very valid on a on going basis. You may need to validate this way again if you change the ETL.

These are all ideas to get you started if you are stuck or have never done this before.
Mamu Kim
Post Reply