Why landing data is good for warehouse batch ETL

ray.wurlod · Post by **ray.wurlod** » Sat Nov 22, 2003 12:53 am

To summarize:

Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.

(Is Aggre Gates the name of Bill's next child?)

Teej · Post by **Teej** » Sat Nov 22, 2003 4:46 pm

*reads*

*reads*

*head explodes*

*attempt to read some more.*

* * *

"What would happens if we secretly replace his brain with genuine KBA-enabled Folgers Crystals? Lets watch..."

"Boy, I feel very fresh!"

* * *

Okay, enough Saturday Silliness -- seriously, this is something that is very difficult to understand without knowing the correct definitions of the words being used here. For example, your definition of a Sandbox is entirely different than my own. So please tell me:

What is an aggregate?

-T.J.

kcbland · Post by **kcbland** » Sat Nov 22, 2003 5:29 pm

How about an example:

Base fact table, say order_transactions. The simple structure is something like:

Code: Select all

order_nbr <pk>
order_line_nbr <pk>
customer_key <fk>
product_key <fk>
order_dt_key <fk>
order_ts
transaction_type
qty
price_amt
ext_price_amt

OLAP is about analyzing volumes of transactions, not individual transactions. Most queries involve rollups of some nature. An aggregate is usually three types. Read this:

http://www.intelligententerprise.com/02 ... e1_1.shtml

Okay, so, a typical analysis or report may be to show totals summed and grouped by order date and product. The fastest way to display those results is usually not to churn through all of the base facts, but to go to a table that has already summed that information. If orders by date by product is a common base for analysis, then having everyone and every report re-summing the base data constantly should ring some bells that that information should be pre-computed. Voila, an aggregate table is born.

This allows fantastic possibilities. Let's say someone runs a query off the base fact table grouping by product and order date. Oracle has query rewrite capability, which means if it sees that your query can be satisfied via an aggregate table, it will courteously rewrite the query and run it against the aggregate.

Let me give you a real world example from a client of mine. Their base orders fact table has about 800 million rows of data. If every report were to have to sift and sum through that table, the reports would never finish. We pre-summarize the data to a level that accommodates all reports. Most financial and forecast reporting is customer irrelevant, so disposing of that along with the order number and line number vastly collapses the data. Instead of 800 million rows you end up with about 30 million rows. This is much more easier to scream through. In addition, an average month has 8 million transactions, but rolled to this level you're down to 500,000.

So, the ETL handles loading the base fact table, as well as computing the incremental update to the aggregates (actually, 9 "base" aggregates, all different permutations). This added 20 minutes total for all 9 to a 90 minute jobstream. It's elegant, simple, and all part of a planned architecture.

Teej · Post by **Teej** » Sat Nov 22, 2003 10:41 pm

kcbland wrote:OLAP is about analyzing volumes of transactions, not individual transactions. Most queries involve rollups of some nature. An aggregate is usually three types. Read this:

http://www.intelligententerprise.com/02 ... e1_1.shtml

This is EXACTLY why I asked. My understanding of the aggregate is limited to what others have used it as -- an action, not as a concept.

With this, I am starting to understand your innate criticsm against PX -- the lack of ability to land the data in a form that can be further manipulated. I was going to suggest Datasets, but now I understand -- a dataset simply is 'as is' kind of situation, not as a tool that can be commanded to spit out limited set of records -- a local database in another sense. Honestly, the use of the word "Hash" can mean so many different and conflicting things.

That article actually was difficult to read due to the lack of texts on those diagrams (and especially since I have such little understanding on the higher-level database design). In fact, as I am sitting here thinking about those three designs, I am realizing that the vast majority of the design team at my company does NOT get it.

That is why we are having such great difficulty with the 5 terrabyte database that we are dealing with right now. Our focus are so much on the optimizing on the current process, not rethinking of better aggregates. In fact, I offered that as a solution to my design team for the Feed programs (which happens to use the same set of data, only slightly tweaked), and it took them a while to understood how it could actually work, even with the limitation of datasets.

Holy crap, this project is going to continue to crash and burn until it is optimized in the design phase. We will have to refactor so many different things that it will definitely upset the higher-up management if they ever find out that their pet project started out on the wrong foot for the past 16 months.

So much to learn... Oh, so much to learn.

Damn, I'm excited all over again! I have been missing this feeling for months!

-T.J.

tomengers · Post by **tomengers** » Mon Nov 24, 2003 12:44 pm

Ken ...

Really appreciate the time you put into this ... this is wonderful stuff and your insights are valued ...

... tom

roy · Post by **roy** » Tue Nov 25, 2003 2:41 am

Hi yall,
I agree with Tom Ken's got some good stuff for plenty of people here.
Besides he's probably saving for those 25,000 points here (j/k)

keep up the great job Ken
And Thanks in the name of all of us you help

(you 2 Ray)

ray.wurlod · Post by **ray.wurlod** » Fri Jul 23, 2004 5:43 pm

Only my solution was more flexible and easier to implement than Red Brick's

Not only does Red Brick automatically maintain its aggregates, it automatically rewrites SQL executed against the base table in order to use the precomputed results stored in the aggregates, where appropriate.

Bet your "solution" doesn't do this!

kduke · Post by **kduke** » Fri Jul 23, 2004 6:54 pm

Red Brick is screamin fast on loads.

peternolan9 · Post by **peternolan9** » Mon Jul 26, 2004 9:58 am

Hi Ray,
and in red-brick you must create a table for each level and you must define the RI so that the loader knows how to consolidate during the load process etc, etc. Still a fine database. But most days I work on Oracle...

Not my choice...

My point was just that ETL processes/software I've been writing since 1994 allows you to do this on any database without all the extra tables and without the need for an IT person to create tables, tablespaces, indexes, RI constraints etc which cannot be created by the average user. (Not that would you give your average user DBA access..

) . Sure, if you want to do all those things, the ETL processes/Software I've written will do that too...

Query re-write has been an interesting discussion.....about 6 years ago Brio/MicroStrategy started to do some query re-write and everyone agreed it should be in the database and Oracle said they would put it into the database.........then Oracle prduced Materialised Views which I feel is about the poorest effort at query re-write I've ever seen...Microstrategy did a great job of query re-write, and BO does a reasonable job of query re-write if you can just be smart enough to know how to build the universe to make it automatic....

Obviously query re-write is not part of an ETL tool...

And, as far as I'm aware, my software would work with Red-Brick as well, I've just never tried it...it has worked with all the databases pretty much unchanged except for Sybase IQ which wants to call a 23 char timestamps 6 chars!!! Also, the databases like to return different codes for a constraint clash on insert......if a redbrick site wants to use it, i'd be interested in giving it a test.....so far I've tested Oracle, DB2 UDB, SQL Server 2000, Sybase ASE/IQ....which is probably enough to get started with....

ray.wurlod wrote:
Only my solution was more flexible and easier to implement than Red Brick's
Not only does Red Brick automatically maintain its aggregates, it automatically rewrites SQL executed against the base table in order to use the precomputed results stored in the aggregates, where appropriate.
Bet your "solution" doesn't do this!

marc_brown98 · Post by **marc_brown98** » Mon Jul 26, 2004 11:56 am

Peter,
Fine work. I also agree query re-write should be enabled at the DB level, let it figure out what tables to use to resolve the query instead of the developer or front end tool, that requires synchronization of some technical meta-data. I actually believe redbrick's value proposal has little to do with agg. awareness and more to do with manageability & performance. Years ago when I was doing exclusive RBW work, I had a chance to see an independent bake off of load performance between all of the heavyweigt DBMS's. RedBrick stomped all of them and they did it with 1 redbrick person there in 3 days vs. an army of dbms vendor people there for weeks. Just my take. The technology has changed since then.

peternolan9 · Post by **peternolan9** » Thu Jul 29, 2004 2:33 am

Hi Marc,
Thank You. I am very proud of my current release of ETL software....and it's started selling well now...... :D

I'm sure the 'army of DBAs' were Oracle...it's just so time consuming to manage....especially with raw devices.....

RB did a fine job of loading DWs......and I felt the loading was a great feature.....but the world uses oracle in most cases so we need to make oracle do what we want it to do no matter how hard that is...

Just by the way, IQ is even better than RB in loading and faster at query and requires almost no DBA time.....it's a 'better mousetrap'....when I first used IQ I complained bitterly about the lack of DBA tools....it was not until we built the production database I realised the lack of DBA tools was because they were not needed!!!!

marc_brown98 wrote:Peter,
Fine work. I also agree query re-write should be enabled at the DB level, let it figure out what tables to use to resolve the query instead of the developer or front end tool, that requires synchronization of some technical meta-data. I actually believe redbrick's value proposal has little to do with agg. awareness and more to do with manageability & performance. Years ago when I was doing exclusive RBW work, I had a chance to see an independent bake off of load performance between all of the heavyweigt DBMS's. RedBrick stomped all of them and they did it with 1 redbrick person there in 3 days vs. an army of dbms vendor people there for weeks. Just my take. The technology has changed since then.

denzilsyb · Post by **denzilsyb** » Thu Jul 29, 2004 3:14 am

ray.wurlod wrote: Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.

Ray, I'm interested - Do you have a site with performance stats on the bulk load? I'd like to compare it to Sybase IQ bulk loading.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 29, 2004 4:23 am

Not working at any Red Brick site at the moment. If it's urgent, you might be able to get some from IBM (the new owners of Red Brick, but with no idea what a star is in their stable - after all, it's not DB2!)

There are many variables that would have to be standardised in such a benchmark, so that apples could be compared validly with apples. But the Red Brick loader really is slick!

denzilsyb · Post by **denzilsyb** » Thu Jul 29, 2004 4:43 am

It will be good to see, and I'll go check out IBMs site - but you should see Sybase IQ at bulk loads.. man, its impressive. Ive seen it at 100 000 rows per second.

and then the queries are just as fast. I once was blind, but now can see

and then theres this... EVERY column is indexed

peternolan9 · Post by **peternolan9** » Thu Jul 29, 2004 5:54 am

But does Red Brick run on Oracle?

(LOL)

denzilsyb wrote:
ray.wurlod wrote: Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.
Ray, I'm interested - Do you have a site with performance stats on the bulk load? I'd like to compare it to Sybase IQ bulk loading.

DSXchange

Why landing data is good for warehouse batch ETL

Why Red Brick Warehouse remains a good choice

Re: Why Red Brick Warehouse remains a good choice