MDM data structures (suggested enhancements)

Infosphere Master Data Management theory and best practices

Moderators: chulett, rschirm, falsehate

Post Reply
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am

MDM data structures (suggested enhancements)

Post by kduke »

Why are most of the MDM topics about something other than MDM?

Anyway, I think most the tables in MDM are designed poorly and cause performance issues. I wanted to discuss this for a while now. MDM lumps a bunch of objects together in what they call parties. Parties are people or organizations mostly. I think they call the table something other than PARTY.

There are links between parties stored in tables ending REL for relationship. These REL tables often describe or store tree structures. A tree structure looks like a Christmas tree. It has one top or parent and many children and grandchildren. An Org chart is a tree structure where the top is the chairman or CEO. All other employees report to him or someone who reports to him or her.

The REL tables store party ids. If this relationship disappears then the party is end dated. Not in the REL table but in the PARTY table. This is where I think these tables are designed wrong.
Last edited by kduke on Mon Jan 06, 2014 3:23 pm, edited 1 time in total.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am

Post by kduke »

I am breaking this up into chunks because it times out on me otherwise.

So next lets look at the REL tables as fact tables. Lets look at what we have to do to display the current Org chart when using this data structure.

Find the top:

Code: Select all

Select 
   P.LAST_NAME,
   P.FIRST_NAME
From
   PARTY P
Inner Join
   ORG_TREE_REL O
On
   P.PARTY_ID = O.CHILD_PARTY_ID
   And P.END_DATE >= CURRENT_DATE
Where
   O.PARENT_PARTY_ID is null


The top of the tree is where it does not have a parent or parent is null. Normally the SQL on these tables is only concerned with whether END_DATE is null or not. Usually not important to compare to today's date.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am

Post by kduke »

Why is this wrong or poorly designed? If all your attributes are similar or work together then eliminate things like type 2 dimensions or snowflaking. When your data structures are clean and all the attributes work together then the ETL loads faster and everything runs smoother. Type 3 dimensions and bridge tables are for morons that do not understand BI or star schemas. When you need to see a column value change by date is because you get some emotional connection to the data. Very seldom does the old value help make a business decision. BI should help make understand your business by looking at facts. Reports against dimensions are silly. For example list all customer's current addresses is NOT a BI report. None the less we have all done these reports for managers who could care less whether it was a BI report or not. Rant over back to design.

The end date is not an attribute of party. It is an attribute of the relationship between 2 parties. Why is this important? The fewer tables joined together to derive some function of business then the faster the SQL runs. If I can do all my work in one table then I can easily add indexes to optimize. Just to find the top of the tree we always have to join PARTY to the REL table in the above example. This is just the top of the tree. Next we need to join parents and children to the REL table to pull in each level below the top. So a minimum of 3 tables for each of these REL tables.

There are 400 tables in MDM and these REL tables store lots of things like the relationship of parties to there addresses. These addresses are grouped together to track primary, home, business and other types of addresses. REL tables on top of REL tables gets complicated real quick.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am

Post by kduke »

Simple fix. Move END_DATE to REL tables. I will rename these to FACT tables so you can see how easily this works.

Find the top:

Code: Select all

Select 
   P.LAST_NAME,
   P.FIRST_NAME
From
   PARTY P
Inner Join
   ORG_TREE_FACT O
On
   P.PARTY_ID = O.CHILD_PARTY_ID
   And O.END_DATE >= CURRENT_DATE
Where
   O.PARENT_PARTY_ID is null



Now because the ON clause is based on fields in Org tree then an index can optimize this and it can run really fast. If you look at the SQL for the next level down then it also because cleaner and easily optimized.

END_DATE is an attribute of the relationship. It describes when the relationship became invalid. It needs to live in the right table.

No telling how many degrees the guy has that designed these tables. I am sure he has a great response for all this but common sense and years of experience tell me I am right and this should be fixed. The speed of MDM will greatly increase with this simple fix.

I apologize if I offended anybody. Everybody using MDM will benefit from these changes. All you ETL folks need to understand data structures and the impact they have on performance. I plan on posting something similar on metadata tools.
Mamu Kim
Post Reply