Problem with wrong surrogate keys

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
alex999
Participant
Posts: 9
Joined: Wed Aug 12, 2015 3:25 am

Problem with wrong surrogate keys

Post by alex999 »

Hi all

I have quite specific problem. I have a client with production DataStage on AIX with DB2. Data warehouse is loaded every night, everything works fine but:
We have noticed that there is a "gap" in numeration of one dimension.
There are ids from 0-xxx and from xxxx-xxxxx. There is no ~4000 id numbers. We have no idea why.

Numbers are generated with NextSurrogateKey() function, without any surrogate_key stage. Key is based on flat file, not on the sequence.

Now DataStage while inserting new ids is filling this gap, so the gap is being reduced every day. I'm not sure also what will happen when the gap in numeration ends, will it overwrite existing numbers?

Do you think it can be repaired somehow?
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

You have no problem. That is the nature of parallel key generation. Each processing node acquires a block of keys to use. When the block is used up, a new block will be acquired. Surrogate keys are only guaranteed to be unique. Gaps are perfectly fine.

Mike
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I do think they have a problem, although I don't understand how they came to be in this predicament. From what I'm reading, while they have a gap in the sequence numbers, DataStage is "filling this gap" which would imply the numbers were set back somehow? If that's true then when the gap is full and numbers start to be reused then the problem will surface because (I assume) you have a unique constraint on that field.

Perhaps when the gap gets to be "almost" full, the key can be reset to just beyond the highest current value?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage may get it wrong, but it's almost certain that there's a UNIQUE constraint in the database that will prevent surrogate key re-use. Perhaps you can re-initialise your state file from the database.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
alex999
Participant
Posts: 9
Joined: Wed Aug 12, 2015 3:25 am

Post by alex999 »

Thank you for the answers. Nobody knows how they came to be in this predicament... With such numeration they have problems with reports because as numbers get changed somehow, filters on reports get wrong data.

Have you ever noticed how DataStage behave with such gaps? There is an unique constraint of course.
Let's say they have for example 1,2,3,4...,100 and then 1000,1001,1002,...,1999
Datastate is creating 5,6,..,99
and what next? 100 or 2000? If it overwrites 100, i think that this old overwriten record should become a new record with ID 2001...

How to reset the key/ re-initialise state file? As for now i don't even know how the file looks like. They have it on the server where I can't check anything remotely.

I'm afraid of one thing. If I reset it, what with all the foreign keys in fact tables? Will it be necessary to reload everything or datastage can update it somehow?

Here's the screen of the key configuration: http://imgur.com/6WAzzj8
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Create a job that has the Surrogate Key Generator stage downstream of an appropriate database Connector stage.

A Surrogate Key Generator stage with an input link can be used to update the state file. Use an appropriate query to get the new initial value.

Code: Select all

SELECT MAX(SK) + 1 AS NEW_BASE_SK FROM tablename
(A Surrogate Key Generator stage with NO links can be used to create a state file.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

alex999 wrote:Have you ever noticed how DataStage behave with such gaps? There is an unique constraint of course.
Let's say they have for example 1,2,3,4...,100 and then 1000,1001,1002,...,1999
Datastate is creating 5,6,..,99
and what next? 100 or 2000? If it overwrites 100, i think that this old overwriten record should become a new record with ID 2001...
It will give out 100 and the insert will fail because of the unique constraint.
alex999 also wrote:How to reset the key/ re-initialise state file? As for now i don't even know how the file looks like. They have it on the server where I can't check anything remotely.
Ray covered that. What it looks like is discussed in the documentation but doesn't really matter. All you really need to understand is how to use / maintain it.
alex999 lastly wrote:I'm afraid of one thing. If I reset it, what with all the foreign keys in fact tables? Will it be necessary to reload everything or datastage can update it somehow?
Nothing to be afraid of. By 'reset' we mean 'push it past the highest value already used'. Using your earlier example that would mean resetting it to 2000... Ray's "max+1". No FK concerns, no reloading, just uninterrupted production loads. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I've done a pretty extensive amount of testing of the surrogate key generator using flat file state files over the years. The behavior that the OP observed is perfectly normal and most likely not a problem.

As long as all of the key values in the table were generated from the same surrogate key state file, you can be confident that they will always be generated uniquely.

DataStage does a great job of managing multiple processes generating keys from the same state file whether that is due to multiple processing nodes or multiple jobs.

The state file is a simple flat file with 1 or more 16 byte records. Each record consists of two 8-byte unsigned big integers.

I have a generic utility job that I use to sync a surrogate key state file with a database table. This comes in handy when you copy data from one database environment to another, when you need to create an initial state file, or any time that you feel a need to sync up.

Every process that needs to generate keys gets an available block of keys from the state file. The default block size is 1000, but is configurable. Each process will grab a block of keys to use. When its block is used up it goes to get another block. When a process finishes, the unused portion of its block is available for another process to use.

Sorry for the long-winded follow-up.

DataStage's surrogate key generator with the flat file state file is one of my favorite features.

Mike
Post Reply