CRC reliability

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
raja123
Premium Member
Premium Member
Posts: 23
Joined: Sat May 03, 2008 11:40 am

CRC reliability

Post by raja123 »

Hi All

I am having history data from client loaded into a hash file. I need to use this hash file and compare with source to get delta data.

Since source is not having any date columns, so I merging all the columns from source as well as from the target and generating CRC values for both. Now my logic is, if source CRC does not match with hash crc, I am passing those rows as delta data to target.

Is this right approach, CRC approach is good enuf if I have rows more than a million?

Please help me out with this issue. I have seen that if rows are more than 200k, it shows that CRC getting duplicated. Is it because of CRC or I have done something wrong in my job.

Thanks and Regards,
Surendra Kumar Sharma
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

CRC32() uses an algorithm to generate a 32-bit integer, repeatable for any given argument. Therefore it has approximately a one chance in 2^32 (one in 4,294,967,296) of generating a false positive. If that's within your comfort zone, go for it. Make sure there are no NULL values in the argument.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Who cares if it generates 'duplicate' values within your data? All you care about is, if any aspect of the data changes, does the new CRC value differ from the old CRC value.

In my mind, Ray's 'false positive' is the chance that the values in a single record would change and still manage to generate the same CRC value. I guess more of a false negative in that case. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
PBalamurugan
Participant
Posts: 30
Joined: Sun Apr 06, 2008 9:58 pm

Post by PBalamurugan »

I too have similar requirement for my current project. I am using merge stage("left only" option) to find delta records. I use current file and prev day file for this merge. After all processing, the prev day file will be overwritten by current file. I go for this approach because of high data volume.

Hope this may help you.
Post Reply