Outside of an occasional comment in an online article or
a brief paragraph in a data warehousing book, we have found
no published material on how to manage an ETL project.
A lot of documentation exists to start a data warehouse
and scads of marketing material exists to sell the idea
to upper management but there is little to help design,
modify or maintain a good set of ETL programs (a.k.a. jobs).
In other words, NO Best Practices exist, at least not in
published form. The purpose of this document is to begin
exploring the many questions that you need to consider when
beginning a data warehouse project and to help you outline
how to implement a successful ETL project.
Be sure to check
out the poll on the DSXchange surveying how community
members are faring with documentation and processes used
to manage the ETL software lifecycle.
Asking the Right Questions
Setting the Stage
The software lifecycle is very clearly defined in a COBOL
mainframe shop. Those guys can track every penny and estimate
a six month project fairly accurately. But for ETL the story
is different. If you have 300 source tables going into
30 dimension tables and 3 fact tables then how many weeks,
months or years will it take? How many DataStage jobs will
be required? If you have 3 DataStage developers then when
should we expect to go live with this data warehouse? When
you are done what kind of documentation is expected or required?
Does your documentation enable those who follow you to support
your work effectively?
Let’s say your standard dimension load is to extract
the main source table into a sequential file. You may have
more tables to join to this in which case you might choose
one big join or you might load these secondary tables into
hash files and look them up as you load or dump to a sequential
file. Perhaps you have a max surrogate key lookup before
you load the job. Do you have a document describing this
standard set of jobs to load one dimension?
Testing Questions
Software lifecycle standards say you should do several
steps of testing. How do you test? Do you just unit
test or do you also system test? Do you do an integrated
test? Do you have a Quality Assurance person or process?
Do you use AuditStage? Do you have a documented testing
procedure?
Implementation Questions
Assuming everything tested okay, now you need to implement
your changes. Do you use Version Control? How do you track
changes? Do you have any kind of documentation within the
long descriptions of your jobs that describes these changes?
Do you have outside documents which describe these changes?
Do you have any kind of project tracking system to do the
bare minimum change request tracking?
Maintenance Questions
Let’s say you have implemented your changes and you
need to add a column to a target table. Do you have any
source to target documents which describe source columns
for target columns in your data warehouse tables? Do you
have any process which will tell you what jobs update what
tables? How often is this process or documentation updated?
Is it current? Is your source to target map in a table so
you can run reports against it? If someone needs to move
a report off the source system and into the data warehouse
can they use this table to show them how to rewrite their
report?
Performance Questions
Could you produce a simple dataflow diagram which shows
all the source tables and how they get transformed into
the target table in the data warehouse? Do you have documents
for each job? Can these be grouped together by target table?
How current are these documents? How much time is involved
in updating them?
Providing the Right Answers
The questions listed above are the basics of ETL software
development. We don’t have space to tackle every
question here but, for the sake of demonstration, let’s
focus on answering questions that will help you estimate
the length of your ETL project.
An Approach to Estimating
Here’s a barebones approach to estimating: Let’s
assume that you have 300 tables going into 30 dimension
tables; i.e., 10 source tables for each dimension (we’ll
not concern ourselves with fact table estimates here).
Each dimension table will take a minimum of 3 jobs –
an extract job, a surrogate key cross reference hash build
for lookup and a load job. (Note that this is the bare
minimum. Additional jobs might include a max surrogate
key lookup, jobs to handle customer type for customer, etc.)
That gives us
(3 jobs per dimension) * (30 dimension
tables) = 90 jobs.
Now let’s add 9 lookup jobs for each dimension.
The math for that is
(9 lookup jobs) * (30 dimension tables)
= 270 jobs.
We’ll assume you have 3 developers each working an
average of 20 days per month. We’ll estimate that
each job takes 8 hours or one day for a developer to produce.
Once you finish all the number crunching you’ll find
that you should be able to deliver this data warehouse in
6 months.
Of course the above ‘answer’ leads to further
questions. How many jobs are in your current production
environment? How many man hours did these jobs take to build?
How many hours would you estimate per job? How do you know?
How did you come up with this number?
The idea here is to get you thinking long term and to do
what is best for your company. We have all seen massive
documents from some third party to build a data warehouse
for some company. And even though this upfront documentation
is awesome often it bears no correlation to what the ETL
developers are actually doing. Why is that? Why do these
companies deliver terrible data warehouses? Where does your
documentation fall? Are you organized?
Documentation Outline
- ETL overall design
- ETL naming conventions
- Projects
- Jobs
- Stages
- Links
- ETL load outline
- Number jobs (e.g., 3 per dimension)
- Types of jobs (extract, load, lookup …)
- ETL tasks
- Tracking system
- ETL estimates
- Number source tables
- Number primary keys in source
- Number of surrogate keys
- Number of columns
- Number of jobs
- Number of hours per job
- ETL changes
- Documented in word processing document
- Documented in the jobs
- Version Control used or other methodology
- Before and after job exports
- ETL admin
- Row counts by link sorted by run times
- Rows per second by link
- Disk space usage
- Table space usage
- Estimated rows from each source table
- Estimated hash file usage
- Metadata management
- System tables used
- Source to target map table
- Tables by job report
- Jobs by table report
- Job documentation
- Html or job design document
- Image of job design included
- Source tables
- Target tables
- Derivations described
- Where used documents
- Included in metadata management
- Source to target mapping table
- Included in metadata management