The Importance of Estimating and Documenting

By Kim Duke and David Baldwin
www.duke-consulting.com

Outside of an occasional comment in an online article or a brief paragraph in a data warehousing book, we have found no published material on how to manage an ETL project.  A lot of documentation exists to start a data warehouse and scads of marketing material exists to sell the idea to upper management but there is little to help design, modify or maintain a good set of ETL programs (a.k.a. jobs).  In other words, NO Best Practices exist, at least not in published form. The purpose of this document is to begin exploring the many questions that you need to consider when beginning a data warehouse project and to help you outline how to implement a successful ETL project.

Be sure to check out the poll on the DSXchange surveying how community members are faring with documentation and processes used to manage the ETL software lifecycle.

Asking the Right Questions

Setting the Stage

The software lifecycle is very clearly defined in a COBOL mainframe shop. Those guys can track every penny and estimate a six month project fairly accurately. But for ETL the story is different.  If you have 300 source tables going into 30 dimension tables and 3 fact tables then how many weeks, months or years will it take? How many DataStage jobs will be required? If you have 3 DataStage developers then when should we expect to go live with this data warehouse? When you are done what kind of documentation is expected or required? Does your documentation enable those who follow you to support your work effectively?

Let’s say your standard dimension load is to extract the main source table into a sequential file. You may have more tables to join to this in which case you might choose one big join or you might load these secondary tables into hash files and look them up as you load or dump to a sequential file. Perhaps you have a max surrogate key lookup before you load the job. Do you have a document describing this standard set of jobs to load one dimension? 

Testing Questions

Software lifecycle standards say you should do several steps of testing. How do you test? Do you just unit test or do you also system test? Do you do an integrated test? Do you have a Quality Assurance person or process? Do you use AuditStage?  Do you have a documented testing procedure?

Implementation Questions

Assuming everything tested okay, now you need to implement your changes. Do you use Version Control? How do you track changes? Do you have any kind of documentation within the long descriptions of your jobs that describes these changes? Do you have outside documents which describe these changes? Do you have any kind of project tracking system to do the bare minimum change request tracking?

Maintenance Questions

Let’s say you have implemented your changes and you need to add a column to a target table. Do you have any source to target documents which describe source columns for target columns in your data warehouse tables? Do you have any process which will tell you what jobs update what tables? How often is this process or documentation updated? Is it current? Is your source to target map in a table so you can run reports against it? If someone needs to move a report off the source system and into the data warehouse can they use this table to show them how to rewrite their report?

Performance Questions

Could you produce a simple dataflow diagram which shows all the source tables and how they get transformed into the target table in the data warehouse? Do you have documents for each job? Can these be grouped together by target table? How current are these documents? How much time is involved in updating them?

Providing the Right Answers

The questions listed above are the basics of ETL software development.  We don’t have space to tackle every question here but, for the sake of demonstration, let’s focus on answering questions that will help you estimate the length of your ETL project.

An Approach to Estimating

Here’s a barebones approach to estimating:  Let’s assume that you have 300 tables going into 30 dimension tables; i.e., 10 source tables for each dimension (we’ll not concern ourselves with fact table estimates here).  Each dimension table will take a minimum of 3 jobs – an extract job, a surrogate key cross reference hash build for lookup and a load job.  (Note that this is the bare minimum.  Additional jobs might include a max surrogate key lookup, jobs to handle customer type for customer, etc.) That gives us

(3 jobs per dimension) * (30 dimension tables) = 90 jobs.

Now let’s add 9 lookup jobs for each dimension.  The math for that is

(9 lookup jobs) * (30 dimension tables) = 270 jobs.

We’ll assume you have 3 developers each working an average of 20 days per month. We’ll estimate that each job takes 8 hours or one day for a developer to produce.  Once you finish all the number crunching you’ll find that you should be able to deliver this data warehouse in 6 months.

Of course the above ‘answer’ leads to further questions.  How many jobs are in your current production environment? How many man hours did these jobs take to build? How many hours would you estimate per job? How do you know? How did you come up with this number?

The idea here is to get you thinking long term and to do what is best for your company. We have all seen massive documents from some third party to build a data warehouse for some company. And even though this upfront documentation is awesome often it bears no correlation to what the ETL developers are actually doing. Why is that? Why do these companies deliver terrible data warehouses? Where does your documentation fall? Are you organized?

Documentation Outline

  1. ETL overall design
  2. ETL naming conventions
    1. Projects
    2. Jobs
    3. Stages
    4. Links
  3. ETL load outline
    1. Number jobs (e.g., 3 per dimension)
    2. Types of jobs (extract, load, lookup …)
  4. ETL tasks
    1. Tracking system
  5. ETL estimates
    1. Number source tables
    2. Number primary keys in source
    3. Number of surrogate keys
    4. Number of columns
    5. Number of jobs
    6. Number of hours per job
  6. ETL changes
    1. Documented in word processing document
    2. Documented in the jobs
    3. Version Control used or other methodology
    4. Before and after job exports
  7. ETL admin
    1. Row counts by link sorted by run times
    2. Rows per second by link
    3. Disk space usage
    4. Table space usage
    5. Estimated rows from each source table
    6. Estimated hash file usage
  8. Metadata management
    1. System tables used
    2. Source to target map table
    3. Tables by job report
    4. Jobs by table report
  9. Job documentation
    1. Html or job design document
    2. Image of job design included
    3. Source tables
    4. Target tables
    5. Derivations described
  10. Where used documents
    1. Included in metadata management
  11. Source to target mapping table
    1. Included in metadata management
Copyright © 2005 DCS-Strategies