Surrogate Key generator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sriram_h
Participant
Posts: 9
Joined: Mon Apr 11, 2005 4:43 am

Surrogate Key generator

Post by sriram_h »

I would like to Know the various functionalities of Surrogate key generator and when should i choose surrogate Key for an application.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello sriram,

the question of surrogate keys is not very specific to DataStage, it lies rather close to the core of Data Warehousing. Every book on DWH will contain either a chapter or at least some reference to surrogate keys, and a quick look at Google threw out this page with a short descrption.

The Px's implementation of surrogate key generation is simple and straightforward, so once you get a feel for where and how to use such an entity you can easily use DS to generate them for you.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Basically the Surrogate Key generates a unique number to each record using a counter. By default the counter starts at 0, but you can specify another number if necessary. For example, if you were generating keys to a table that gets appended daily, you may want to get the max id in the table, and use max id + 1 as the starting value in your surrogate key stage.

We have used surrogate keys in situations where multiple data keys are required to identify a unique record, like customer-to-account relationships (keys = customer, account, relationship-code, effective_date, etc). It is easier and more efficient to join on a single key than many.

In other situations, we are receiving data from a source that may eventually reuse its id's. Say we get account data and the source provides a cust_acct_id as a unique identifier. We use cust_acct_id as our primary key and then the source decides that they are going to start reusing cust_acct_id's that have been purged from their system. That works fine for them, they are an operational system and only keep current data. We, however, are a data warehouse and store data historically. If we keep cust_acct_id as our only key, then we have to track the source's reuse of each id. On the other hand, if we use a generated key and carry cust_acct_id simply as another data field, then we have removed the dependency on the source's cust_acct_id and their reuse.

Hope this helps.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Small pedantic point. Surrogate keys don't need to be a sequence of numbers, they don't even have to be numbers; all they have to be is unique within the table.

It happens that generating a sequence of integers is the easiest way to achieve this, and that integers are maximally efficient for joins. However, if you generate a surrogate key value then end up not using it (maybe because the row is rejected), there's nothing wrong with the valid data in the table. Even though there's a gap in the sequence you still have uniqueness.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

ray.wurlod wrote:Small pedantic point. Surrogate keys don't need to be a sequence of numbers, they don't even have to be numbers; all they have to be is unique within the table.
You're right, but for some reason the Surrogate Key stage in DS (in the Processing group) only uses 16,32, or 64 bit integers. If you want to generate something other than integers, you can use the Generator Stage (in the Development/Debug group).
Adam_Clone
Participant
Posts: 26
Joined: Fri Apr 08, 2005 12:58 am

Re: Surrogate Key generator

Post by Adam_Clone »

hi
Surrogate keys come into use usually in 'Type 2' category of jobs. These type jobs are those used to handle historcal data. For eg, say someone's address changes, and we need to keep the old address also in the data warehouse. But now the problem is that this row with the old data cannot be kept in the warehouse with the same key, and moreover the key itself should be the same for the row with the old address and the new one, which are effectively the same person. So what we do is we keep the row data intact with the same key for both the rows (Same row with new address and old address) and add one more column to the table in the warehouse which acts as the new key for the table. The surrogate key generator object is used to generate this key called the surrogate key. Remember to unclick the 'Key' option for the original key column in the output link from surrogate key generator stage.
I hope this has cleared your doubt.
Post Reply