Why ETL is hard

datastage · Post by **datastage** » Tue Apr 01, 2003 9:58 am

According to TDWI white paper 'Evaluating ETL and Data Integration Platforms' by Wayne Eckerson and Colin White:

"According to most practicioners, ETL design and development work consumes 60 to 80 % of an entire BI project. With such an inordinate amount of resources tied up in ETL work, it behooves BI teams to optimize this layer of their BI environment.

ETL is so time consuming because it involves the unenviable task of re-integrating the enterprise's data from scratch. Over the span of many years, organizations have allowed their business processes to dis-integrate into dozens or hundreds of local processes, each mananaged by a single fiefdom (e.g., departments, business units, divisions) with its own systems, data, and view of the world.

With the goal of achieving a single version of truth, business executives are appointed BI teams to re-integrate what has taken years or decades to undo. Equipped with ETL and modeling tools, BI teams are now expected to swoop in like conquering heroes and rescue the organization from information chaos. Obviously the challenges and risks are daunting"

Anyone else have other thoughts/comments about why ETL is hard? I know I always find friction in projects where the expectation is that the ETL task is 15-20% of the project and as this statement points out it is typically 60-80%. Thus the ETL team is always blamed for causing deadlines to slip and budgets to soar.

kcbland · Post by **kcbland** » Tue Apr 01, 2003 10:49 pm

Here's my 2 cents worth:

ETL is hard because:

(1) The data warehouse team lacks an experienced and "blooded" data warehouse architect. One intimately knowledgable in all aspects, from software development lifecycle, to modular programming techniques, to data warehouse architectures and philosophies.
(2) The data warehouse has a model that doesn't reflect characteristics of data warehousing, therefore, when data warehousing problems come up (atomic level tables, slowly changing dimensions, etc), the model can't support the standard industry practice solution.
(3) Management has not put together a team of resources who aren't familiar with data warehousing. The development team are coders, without #1 and #2 above, the team won't be able to do what is necessary.
(4) Nobody on the team understands ETL. A source table and a target table are viewed as a silo ETL development process. The holistic framework is not considered, and silo development takes place. No integrated design and framework emerges, therefore, every source-to-target ETL process is a one-off. You can't have a team of 20 ETL architects who also do their own development.
(5) The ETL tool is viewed as either an all-encompassing solution or a slight piece of the solution. Developers play to their strengths, rather than to the tool's strengths.
(6) The data warehouse is developed outside of a software development lifecycle (SDLC, look it up, for those who don't know what I'm talking about). Versioned, planned, scheduled releases along with tried and true development practices are often chucked by the wayside.
(7) Poor source system analysis and mappings. Garbage in, garbage out. Gotta know what you got and where it's going to give the ETL any shot at success. Knowing your source data is, in my opinion, where the ETL fails most. (Second would be in the target model, sheesh, anyone EVER read Kimball or Inmon? Does a 3NF have to be the run-home-to-momma solution?)
(8) The company has been sold an ETL toolset marketed as solving #1, #2, #3, #4, #5, #6, and #7 above. It takes a highly talented team in the key roles (project manager, data architect, etl architect), knowledgable and experienced in data warehousing, to give the ETL a chance at resounding success in the first attempt. Most first attempt data warehouses fail because of the learning curve is soooo steep.

Seen it, lived it, sold it. I did engagements where I retired the competitions tool with DataStage. I know an engagement where my work was retired by the competition. I know that a highly successful engagement of mine was retired because the "Total Cost of Deployment" far exceeded what the company was willing to do for the next iteration of the warehouse. I spend a lot of my time mentoring the inhouse team in data warehouse fundamentals, and why tables look the way they do and how best to load them.

Okay, that was more like 4 cents worth.

-Ken

kjanes · Post by **kjanes** » Wed Apr 02, 2003 1:53 pm

Ken's number 6 is a huge issue. Without a structured approach, the project will be much more challenging.

I will add three more reasons:

1) Lack of meta data. This is very significant.

2) Lack of business analysis experience.
This is essential to define the transformation of data from source to warehouse and all of the business rules in between.

3) Lack of vision.
Where does all of this lead and how are we going to use it?

ray.wurlod · Post by **ray.wurlod** » Thu Apr 03, 2003 4:20 pm

The biggie for me is "up front", getting agreement on the business metadata. In many cases I've ended up the referee in heated arguments over the actual meaning of terms used within the business!
What I now do about it - given that the arguments will always occur - is to document the meeting and its decision, and have the parties agree by signing the document.
Then, later, when it "was wrong", the document that shows their agreement is produced, and their demand for a change becomes a regular system change request (for which there's a cost, of course).
The important thing, then, is to record a reference to that document in the metadata used within the ETL tool (DataStage) so that future developers can also wave the document in "their" faces.

qt_ky · Post by **qt_ky** » Mon May 02, 2016 1:40 pm

datastage wrote:Anyone else have other thoughts/comments about why ETL is hard? I know I always find friction in projects where the expectation is that the ETL task is 15-20% of the project and as this statement points out it is typically 60-80%. Thus the ETL team is always blamed for causing deadlines to slip and budgets to soar.

13 years later, I am seeing the same friction and same blame game in the current organization.

All these points in this topic and the replies still ring true.

UCDI · Post by **UCDI** » Mon May 02, 2016 2:27 pm

I am still pretty new to it but the sheer amount of unrelated info required is what bogs me down. I have to memorize that this table is related to that table is related to another table junk and that this one has many of that one blah blah. None of it is just there to see ... in my C++ career I could just right click anything to see what it was, where it came from, and all. Here, I have to exit my tool and open something else or worse, find a person who knows a person who knows where to go and who to ask.

Its not the ETL. Its the knowledge required about each specific database before you can even start the ETL that is slow, from what I have seen so far.

FranklinE · Post by **FranklinE** » Tue May 03, 2016 9:37 am

First, hat tip to qt_ky for bringing this up to the top of the list.

I find myself bogged down in #5, the all or just a bit false dichotomy. I started life in IT hip-deep in mainframe system development, Cobol and JCL, and no one at any point looked at it in that light. It was a tool on the list, and planning regarded it as such and made effective decisions about its proportion of use in any given project.

I've tried to bring expectations to a reasonable middle point, but it too often gets ignored. Either DataStage is tapped because of its being a rapid application development tool -- because, you know, doing it faster is always good

-- or it gets overlooked because from the outside it looks both weird and difficult.

kduke · Post by **kduke** » Tue May 03, 2016 9:44 pm

Great topic to bring back. I miss Ken Bland. He always added a level of deep thought. I still see developers using DataStage poorly. Their SQL takes forever to come back which tells me they did very little testing. It also means they are probably doing too much in SQL. Doing too many joins in their source SQL is big issue especially with outer joins. How someone removes dupes usually tells if they understand partitioning.

I think a lot of bad habits are made when learning Informatica first. From what I can tell Informatica cannot survive without staging tables. They also tend to use SQL for joins and transforming. These developers are afraid of nulls so they scrub them out in SQL. Very poor.

I still see poor performance in almost every point by Ken. Poor analysis leads to poor design which leads to poor mapping which leads to poor implementation. When you build a house on a poor foundation then the house will not be strong.

I think it starts with a poor attitude. If you set out to build a center of excellence then you start out with an attitude of excellence. The list above is just the basics. How can you do the more complicated things if you cannot do the basics? Metadata, governance and sanity checks are all more advanced topics. Companies missing these three probably have poorly implemented data warehouses. It takes depth of thought and character to build these last three. You can tell from our politicians we have a serious lack of character in America.

eostic · Post by **eostic** » Thu May 05, 2016 5:57 am

Great thread....indeed, kudos for bringing it back. There is significant discussion all around our industry regarding "ETL" (or even just "data integration") and its evolution alongside current trends such as "data lake", "cloud" and methodologies for the kind of data (how clean?, etc.) one puts into places like hadoop. Some of it just mild bending and twisting of our long time existing definitions and roles, and in other cases, wholesale turning of the data integration/cleansing/repair/aggregation/consolidation (+more) paradigm on its ear --- putting tranformation activity and data repair/correction in the hands of near-business users. Anyone out there seeing or hearing from customers about tooling such as Astera, EasyMorph, or even Tableau, for pushing transformation and ETL-like functionality closer to the business?

Ernie

chulett · Post by **chulett** » Thu May 05, 2016 6:52 am

kduke wrote:I think a lot of bad habits are made when learning Informatica first. From what I can tell Informatica cannot survive without staging tables. They also tend to use SQL for joins and transforming. These developers are afraid of nulls so they scrub them out in SQL. Very poor.

Hey, I resemble that remark! Or maybe the inverse of it, can't believe I haven't touched DataStage since 2009 and I'm still here exposing my growing ignorance day after day.

There actually isn't a whole lot of difference between Informatica and DataStage IMHO, the former being akin to a hybrid of the two flavors of the latter in my experience, which helps isolate the Infa folks from much of the trials and tribulations of the PX side of your world. Don't need staging tables there any more than you would here. The comment about nulls is interesting to me as Infa developers are actually shielded from them to a great extend such that they don't have to worry about them as much as they should. Does this null equal that null? Why, yes - yes it does! And the comment about SQL is something I've seen since I've started and still do to this day regardless of tool, some folks think they are SQL Wizards and do every last bit of the transformations in the source query, no matter how unreadable or fugly or badly crafted it might be - and then the job just streams the results to the target. I've inherited a bunch of stuff like that here and it maketh me crazy. Never mind the young pups who think they already know everything there is to know about ETL when in reality they know next to nothing as they honestly haven't been blooded yet. <sigh>

I for one would love to hear answers to Ernie's questions as that's something I've been wondering about for some time now. Haven't seen it, though, being pretty well isolated in the 'tradition' tool space.

And I miss Ken too, he always had an opinion and a unique wealth of knowledge to share.

kduke · Post by **kduke** » Thu May 05, 2016 6:46 pm

Awesome answer Craig. I too wish Ernie would start a new topic or use this one to expand on what these new tools are trying to address and solve. What questions are coming that we had no idea were even issues. Please educate us Ernie to what is out there. If there are cool concepts or tools being discussed then I would love to hear about it. I think we are blindsided by these new concepts sometimes.

I apologize if I come off harsh but I think just because something works does not mean it is a good solution. Whatever happened to continuous improvement? I guess I am old school and think I am always learning and have not arrived. Some kid straight out of school thinks he has all the answers and people follow his advice is really scary. The more knowledge I have the more I realize how little I really know and understand. I used to be one of those scary kids. Guys like Ray and Ken taught me how little I really knew.

ray.wurlod · Post by **ray.wurlod** » Thu May 05, 2016 7:33 pm

And I too am always learning and have not arrived.

I invest considerable quantities of my own time in doing so, because there's only so much you can learn from courses (and my manager says I no longer need them).

ray.wurlod · Post by **ray.wurlod** » Thu May 05, 2016 7:40 pm

And I wonder, often out loud, how it is that someone in their early twenties can have a title including the word "Senior".

ray.wurlod · Post by **ray.wurlod** » Thu May 05, 2016 7:44 pm

eostic wrote:Anyone out there seeing or hearing from customers about tooling such as Astera, EasyMorph, or even Tableau, for pushing transformation and ETL-like functionality closer to the business?

Starting to see Tableau trying to push into "my" space. Nowhere near the richness of functionality, of course, but their pushback is that they only have a single source, so why not?

(They have no idea how that single source - a data warehouse - got populated with trustable data!)

DSXchange

Why ETL is hard

Why ETL is hard

Re: Why ETL is hard