Page 1 of 1

ETL date range between join

Posted: Fri Aug 27, 2004 11:46 am
by kduke
Currently we have employees assigned to a team to service a customer. There is also another team of 3 assigned to manage their planning, buying and general questions. This second team gets a bonus based on their role. Example:

BuyRep 50% for 2003
PlanRep 25%
GenRep 25%

Lets say our customer is Ford Motor. So from Jan to May they had reps 1,2,3 after that they had 4,2,3. At some point you need to join this to a fact table at a monthly level or granularity. Reps 1,2 have always been a part of the New York team. Rep 3 switched from NY to the LA team. Rep 4 has always been in the LA team.

Code: Select all

SalesTeam
Rep Team StartYearMonth EndYearMonth SalesTeamNo
1     NY     200101             Null                111
2     NY     200301             Null                111
3     NY     200202             200306           111
3     LA     200307             Null                111
4     LA     200101             Null                111

RepTeam
Cust  Rep StartYearMonth EndYearMonth RepTeamNo RepType
Ford  1     200301             200305           321            Buy
Ford  4     200306             200312           321            Buy
Ford  2     200301             200312           321            Plan
Ford  3     200301             200312           321            Gen
 


How would you combine all this to a single dimension table? How would you look this key up as the fact loads?

I will post our solution in a few days. Basically you have a between join. They are super slow. There are 100,000 rows in each source table so the combination is in the millions.

Posted: Fri Aug 27, 2004 1:04 pm
by chulett
Is this like... a quiz, Kim? :lol:

Posted: Fri Aug 27, 2004 3:01 pm
by kduke
Sure why not. I think a lot of people have these issues and maybe they have a better solution. I think what we did is kind of cool.

Anyway here is what we did. We had a day dimension table. You could easily create one. You can join both sides to this table and create a record for each month. That way the fact table is also easy to create because you know what month each fact happened. To join these you need to check several possibilities otherwise. Your between join is super complex.

Join possibilites for valid dates:
1. First table's start date is > second start
and first end date is > second end date
(Start1 < Start2 and End1 > End2)
2. Start1 < Start2 and End1 < End2 and Start1 > End2
3. Start1 < Start2 and End1 < End2 and Start2 < End1
4. Start1 > Start2 and End1 < End2
5. Start1 > Start2 and End1 > End2 and Start1 < End2
6. Start1 > Start2 and End1 > End2 and Start2 > End1

If you ever done date range lookups then this is an elogant solution.

Code: Select all

select
  whatever
from
   Table1,
   Table2,
   MonthTable
where
   MonthTable.YearMonth between Table1.StartYearMonth and Table1.EndYearMonth
and
   MonthTable.YearMonth between Table2.StartYearMonth and Table2.EndYearMonth



[/code]

Posted: Sat Aug 28, 2004 9:01 am
by kduke
Not bad but how do you build the multivalue hash file within a job without writing the whole thing in BASIC. Ken has a trick to load it comma delimited. You need the dates sorted descending for the locate to be fast.

The next problem is still have not joined these 2 together. I am not sure what you mean by "parallel". Is this another term for associated multivalues. There are really 2 sets of multivalues here. You still have to create the concept of a team in a dimension which is a unique grouping of reps across both sets of dates.

This really represents 2 dimension tables maybe 3 or 4. One for sales team. Another for sales reps with their associate percentage and role (but, plan, gen). The percentage changes by date. This dimension may need to be snowflaked by customer. You still need a dimension for the individuals.

I agree you created a super fast lookup but I need to know the distinct combinations of individuals or Reps which make up this group.

I want to know can or has anyone aggregated in a straight job without a DataStage aggregation stage to a multivalued hash file. Will it let you do it.

Say we sorted SalesTeam by Rep and Date. We look up Team and Rep. If not found then StartYearMonth and EndYearMonth are just passed through. If we have loaded this record already then it becomes:

Code: Select all

StartYearMonth = Lookup.StartYearMonth :@VM: InLink.StartYearMonth
EndYearMonth = Lookup.EndYearMonth :@VM: InLink.EndYearMonth


The key to this hash file will be SalesTeam and Rep. The one hash file is both the lookup and the output hash file.

You would need a similar trick to all the unique combinations of Reps. I think you need my month trick on the source to get one record per month. I think your key would be Cust and YearMonth.

Code: Select all

RepGroup = Lookup.RepGroup :",": InLink.RepId


Now you would need to do a UV stage to group by RepGroup to remove duplicates. You would need min(StartYearMonth) and max(EndYearMonth) to get it's date range. If you assign surrogate keys to these then you could lookup SalesTeam. Probably run in just a few minutes.

What do you think? Pretty cool. Will it work?

Ken if this works then you do not need to convert commas to @VM.

The other problem can the customer support this after I leave. Multivalues are very difficult to teach and for newbies to understand. Got to be super fast though. Nice one Chuck.