CRC32 Produce Troubling Results

Bryceson · Post by **Bryceson** » Fri Aug 10, 2007 8:20 am

Hi Everyone,

We are at the begining of the project I am working on testing CRC32 and are about to establish the ETL DataStage strategy, we want to use CRC32 in development of Server jobs to detect changes.

I have searched the Forum about CRC32 topic and there were enough information and comments on how it can be used.

I tried to use CRC32 function for the initial load of 100000 rows with seven columns, trimed and checked for null and it produce the same values for two different records. 200K rows get 10 rows with 2 each has the same value.

But If I load less than 100000 rows (about 99000 rows) it produce unique values for every single row. My goal here is to run through at least 20 million records.

I understand there is 1 chance in 4 billions rows the CRC32 to produce the same value, but I do not even pass the 100K mark and a I get a dupe value!

Is there a specific way to use this function??

Does anybody knows a free utility out there that will aid in change data capture process??

Any thoughts and ideas is appreciated.

Thanks . . . . Bryceson

chulett · Post by **chulett** » Fri Aug 10, 2007 8:48 am

I'm confused. Why would you care if 2 out of 100,000 records had the same value in the seven data fields you are concatenating and CRC32'ing unless those records had the same natural key? Is that what you are saying?

Bryceson · Post by **Bryceson** » Fri Aug 10, 2007 10:37 am

Craig,

No, the records does not have the same natural key!!! Isn't the CRC32 suppose to create unique values for that matter??

Thanks . . . Bryceson

chulett · Post by **chulett** » Fri Aug 10, 2007 10:57 am

No. You have to understand the (and I hate to use this word) 'limitations' of a 32bit CRC value. Plus how to properly leverage it to detect change in a Server job. The way we've done it which works just fine is you store the CRC32 value for the fields you are tracking changes against per natural key and then store the CRC32 value as the single data element in a hashed file keyed by the natural or mapped surrogate keys, depending on where you are in your jobstream. And then your check is again per key - did any of the values of these data fields change on a key by key basis? That would be the question I would think you are trying to answer.

Don't expect to generate values for all of your records and get unique values for each and every one of them, that's not going to happen. Nor does it need to. If you've read the posting here on CRC32, especially those by Michael Hester who wrote the utility for Ascential, you should have read that. To me, that's not much different from the people having 'issues' when they try to use CRC32 to generate unique surrogate keys - it's not an appropriate use.

Perhaps it would be helpful if you fully explained how you were planning on leveraging the CRC32 values you've been generating?

Bryceson · Post by **Bryceson** » Fri Aug 10, 2007 12:19 pm

Craig,

I want to use the CRC32 values to identify if the incoming row has changed or it is a new record.

1. For a new records - compare this new CRC32 value with the CRC32 values stored in a hashed file, not the same its an insert.

2. For changed records - check against the hashed files existence of the Natural Key AND compare this new CRC32 value with the CRC32 values stored in a hashed file, not the same its an update.

3. Update the hashed file - compare this new CRC32 value with the CRC32 values stored in a hashed file, if the same.

Any thoughts on this approach??

Thanks . . . Bryceson

chulett · Post by **chulett** » Fri Aug 10, 2007 12:54 pm

Basically, that all sounds fine to me.

1. A new record wouldn't get a hit on the hashed file, so it seems like a simple NOTFOUND link status would tell you a record is new.

2. Yup.

3. For any inserts or updates you'd want to update the hashed file with the new CRC32 value so you don't have to do it again later. If a natural key could be seen more than once in the load, then you would need to do this - all the normal caveats apply in that case: don't cache or cache and lock for updates. I prefer the former.

I prefer to write 'new hashed file keys' to a flat file and then read from the flat file and update the hashed file once the database work is finished in the job. Requires a transaction size of 0 for everything to work out properly in the event of a failure, but that's normal for us. What I'm trying to avoid are situations where db work rolls back but hashed file changes do not. Obviously, the other approach would be to enable the 'backup' option for the hashed file, then it can be 'rolled back' when a job is reset.

In any case, you'd need something to 'seed' this hashed file and do a CRC32 calc on all existing records.

Hope that all helps...

chulett · Post by **chulett** » Fri Aug 10, 2007 12:55 pm

And before Ray wakes up and whacks us:
:D

ray.wurlod · Post by **ray.wurlod** » Fri Aug 10, 2007 4:06 pm

CRC32 is like a hashing algorithm that generates a 32-bit unsigned integer based on the byte values in the supplied character string. It therefore carries with it the probability that the same value might be generated from two different input values, at a probability level of 1 in 2^32 (approximately 1 in four million).

CRC64 also carries a finite (but small) probability of false positive, in this case one in 18,446,744,073,709,551,616.

No, I don't have a CRC64 routine in my back pocket.

Bryceson · Post by **Bryceson** » Sun Aug 12, 2007 8:44 pm

Thanks Craig!

1. Yes, I thing NOTFOUND will do for new record. Got it!

3. I meant to say Insert or Update (continue to Update the Hashed File)

Ray,

Seriously, where could be an ideal palce to get the CRC64?? Is this something IBM is working on?? Did not even think CRC64 could be existing out there . . .

Thanks . . . . Bryceson

ArndW · Post by **ArndW** » Sun Aug 12, 2007 9:59 pm

The mathematics of CRC are interesting and well-defined, plus you can find C code implementations of CRC for 32 and 64 bit in the public domain. Start on this Wikipedia page