Why landing data is good for warehouse batch ETL

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

To summarize:

Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.

(Is Aggre Gates the name of Bill's next child?) :lol:
Last edited by ray.wurlod on Sat Nov 22, 2003 7:50 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

*reads*

*reads*

*head explodes*

*attempt to read some more.*

* * *

"What would happens if we secretly replace his brain with genuine KBA-enabled Folgers Crystals? Lets watch..."

"Boy, I feel very fresh!"

* * *

Okay, enough Saturday Silliness -- seriously, this is something that is very difficult to understand without knowing the correct definitions of the words being used here. For example, your definition of a Sandbox is entirely different than my own. So please tell me:

What is an aggregate?

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

How about an example:

Base fact table, say order_transactions. The simple structure is something like:

Code: Select all

order_nbr <pk>
order_line_nbr <pk>
customer_key <fk>
product_key <fk>
order_dt_key <fk>
order_ts
transaction_type
qty
price_amt
ext_price_amt
OLAP is about analyzing volumes of transactions, not individual transactions. Most queries involve rollups of some nature. An aggregate is usually three types. Read this:

http://www.intelligententerprise.com/02 ... e1_1.shtml

Okay, so, a typical analysis or report may be to show totals summed and grouped by order date and product. The fastest way to display those results is usually not to churn through all of the base facts, but to go to a table that has already summed that information. If orders by date by product is a common base for analysis, then having everyone and every report re-summing the base data constantly should ring some bells that that information should be pre-computed. Voila, an aggregate table is born.

This allows fantastic possibilities. Let's say someone runs a query off the base fact table grouping by product and order date. Oracle has query rewrite capability, which means if it sees that your query can be satisfied via an aggregate table, it will courteously rewrite the query and run it against the aggregate.

Let me give you a real world example from a client of mine. Their base orders fact table has about 800 million rows of data. If every report were to have to sift and sum through that table, the reports would never finish. We pre-summarize the data to a level that accommodates all reports. Most financial and forecast reporting is customer irrelevant, so disposing of that along with the order number and line number vastly collapses the data. Instead of 800 million rows you end up with about 30 million rows. This is much more easier to scream through. In addition, an average month has 8 million transactions, but rolled to this level you're down to 500,000.

So, the ETL handles loading the base fact table, as well as computing the incremental update to the aggregates (actually, 9 "base" aggregates, all different permutations). This added 20 minutes total for all 9 to a 90 minute jobstream. It's elegant, simple, and all part of a planned architecture.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

kcbland wrote:OLAP is about analyzing volumes of transactions, not individual transactions. Most queries involve rollups of some nature. An aggregate is usually three types. Read this:

http://www.intelligententerprise.com/02 ... e1_1.shtml
This is EXACTLY why I asked. My understanding of the aggregate is limited to what others have used it as -- an action, not as a concept.

With this, I am starting to understand your innate criticsm against PX -- the lack of ability to land the data in a form that can be further manipulated. I was going to suggest Datasets, but now I understand -- a dataset simply is 'as is' kind of situation, not as a tool that can be commanded to spit out limited set of records -- a local database in another sense. Honestly, the use of the word "Hash" can mean so many different and conflicting things.

That article actually was difficult to read due to the lack of texts on those diagrams (and especially since I have such little understanding on the higher-level database design). In fact, as I am sitting here thinking about those three designs, I am realizing that the vast majority of the design team at my company does NOT get it.

That is why we are having such great difficulty with the 5 terrabyte database that we are dealing with right now. Our focus are so much on the optimizing on the current process, not rethinking of better aggregates. In fact, I offered that as a solution to my design team for the Feed programs (which happens to use the same set of data, only slightly tweaked), and it took them a while to understood how it could actually work, even with the limitation of datasets.

Holy crap, this project is going to continue to crash and burn until it is optimized in the design phase. We will have to refactor so many different things that it will definitely upset the higher-up management if they ever find out that their pet project started out on the wrong foot for the past 16 months.

So much to learn... Oh, so much to learn.

Damn, I'm excited all over again! I have been missing this feeling for months!

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
tomengers
Participant
Posts: 167
Joined: Tue Nov 19, 2002 12:20 pm
Location: Key West

Post by tomengers »

Ken ...

Really appreciate the time you put into this ... this is wonderful stuff and your insights are valued ...

... tom
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi yall,
I agree with Tom Ken's got some good stuff for plenty of people here.
Besides he's probably saving for those 25,000 points here (j/k) :lol:

keep up the great job Ken
And Thanks in the name of all of us you help :)
(you 2 Ray)
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Why Red Brick Warehouse remains a good choice

Post by ray.wurlod »

Only my solution was more flexible and easier to implement than Red Brick's
Not only does Red Brick automatically maintain its aggregates, it automatically rewrites SQL executed against the base table in order to use the precomputed results stored in the aggregates, where appropriate. 8)
Bet your "solution" doesn't do this!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Red Brick is screamin fast on loads.
Mamu Kim
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Re: Why Red Brick Warehouse remains a good choice

Post by peternolan9 »

Hi Ray,
and in red-brick you must create a table for each level and you must define the RI so that the loader knows how to consolidate during the load process etc, etc. Still a fine database. But most days I work on Oracle... :shock: Not my choice... :x


My point was just that ETL processes/software I've been writing since 1994 allows you to do this on any database without all the extra tables and without the need for an IT person to create tables, tablespaces, indexes, RI constraints etc which cannot be created by the average user. (Not that would you give your average user DBA access.. ;-) ) . Sure, if you want to do all those things, the ETL processes/Software I've written will do that too...

Query re-write has been an interesting discussion.....about 6 years ago Brio/MicroStrategy started to do some query re-write and everyone agreed it should be in the database and Oracle said they would put it into the database.........then Oracle prduced Materialised Views which I feel is about the poorest effort at query re-write I've ever seen...Microstrategy did a great job of query re-write, and BO does a reasonable job of query re-write if you can just be smart enough to know how to build the universe to make it automatic....

Obviously query re-write is not part of an ETL tool... ;-)

And, as far as I'm aware, my software would work with Red-Brick as well, I've just never tried it...it has worked with all the databases pretty much unchanged except for Sybase IQ which wants to call a 23 char timestamps 6 chars!!! Also, the databases like to return different codes for a constraint clash on insert......if a redbrick site wants to use it, i'd be interested in giving it a test.....so far I've tested Oracle, DB2 UDB, SQL Server 2000, Sybase ASE/IQ....which is probably enough to get started with....

ray.wurlod wrote:
Only my solution was more flexible and easier to implement than Red Brick's
Not only does Red Brick automatically maintain its aggregates, it automatically rewrites SQL executed against the base table in order to use the precomputed results stored in the aggregates, where appropriate. 8)
Bet your "solution" doesn't do this!
Best Regards
Peter Nolan
www.peternolan.com
marc_brown98
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 14, 2004 11:33 am

Post by marc_brown98 »

Peter,
Fine work. I also agree query re-write should be enabled at the DB level, let it figure out what tables to use to resolve the query instead of the developer or front end tool, that requires synchronization of some technical meta-data. I actually believe redbrick's value proposal has little to do with agg. awareness and more to do with manageability & performance. Years ago when I was doing exclusive RBW work, I had a chance to see an independent bake off of load performance between all of the heavyweigt DBMS's. RedBrick stomped all of them and they did it with 1 redbrick person there in 3 days vs. an army of dbms vendor people there for weeks. Just my take. The technology has changed since then.
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

Hi Marc,
Thank You. I am very proud of my current release of ETL software....and it's started selling well now...... :D

I'm sure the 'army of DBAs' were Oracle...it's just so time consuming to manage....especially with raw devices.....

RB did a fine job of loading DWs......and I felt the loading was a great feature.....but the world uses oracle in most cases so we need to make oracle do what we want it to do no matter how hard that is...

Just by the way, IQ is even better than RB in loading and faster at query and requires almost no DBA time.....it's a 'better mousetrap'....when I first used IQ I complained bitterly about the lack of DBA tools....it was not until we built the production database I realised the lack of DBA tools was because they were not needed!!!!

marc_brown98 wrote:Peter,
Fine work. I also agree query re-write should be enabled at the DB level, let it figure out what tables to use to resolve the query instead of the developer or front end tool, that requires synchronization of some technical meta-data. I actually believe redbrick's value proposal has little to do with agg. awareness and more to do with manageability & performance. Years ago when I was doing exclusive RBW work, I had a chance to see an independent bake off of load performance between all of the heavyweigt DBMS's. RedBrick stomped all of them and they did it with 1 redbrick person there in 3 days vs. an army of dbms vendor people there for weeks. Just my take. The technology has changed since then.
Best Regards
Peter Nolan
www.peternolan.com
denzilsyb
Participant
Posts: 186
Joined: Mon Sep 22, 2003 7:38 am
Location: South Africa
Contact:

Post by denzilsyb »

ray.wurlod wrote: Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.
Ray, I'm interested - Do you have a site with performance stats on the bulk load? I'd like to compare it to Sybase IQ bulk loading.
dnzl
"what the thinker thinks, the prover proves" - Robert Anton Wilson
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not working at any Red Brick site at the moment. If it's urgent, you might be able to get some from IBM (the new owners of Red Brick, but with no idea what a star is in their stable - after all, it's not DB2!) :roll:

There are many variables that would have to be standardised in such a benchmark, so that apples could be compared validly with apples. But the Red Brick loader really is slick!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
denzilsyb
Participant
Posts: 186
Joined: Mon Sep 22, 2003 7:38 am
Location: South Africa
Contact:

Post by denzilsyb »

It will be good to see, and I'll go check out IBMs site - but you should see Sybase IQ at bulk loads.. man, its impressive. Ive seen it at 100 000 rows per second. :shock: and then the queries are just as fast. I once was blind, but now can see

:!: and then theres this... EVERY column is indexed :!:
dnzl
"what the thinker thinks, the prover proves" - Robert Anton Wilson
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

But does Red Brick run on Oracle? ;-) (LOL)
denzilsyb wrote:
ray.wurlod wrote: Get Red Brick Warehouse! World's best bulk loader, and automatic maintenance of aggregates.
Ray, I'm interested - Do you have a site with performance stats on the bulk load? I'd like to compare it to Sybase IQ bulk loading.
Best Regards
Peter Nolan
www.peternolan.com
Post Reply