CRC32 Does have limitations (severe in my opinion)

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

CRC32 Does have limitations (severe in my opinion)

Post by lclapp »

Per DataStage customer support:javascript:

"The CRC32 function in DataStage does support character strings above 32 bits. There is a limitation on checksums to determine duplicate strings and it can return duplicate values."

I have several tables which do not have any keys assigned. I built a crc32 derivation using pipes between the columns.

Sample data:

1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1

1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1

The resulting CRC32 value for both input is -1614948330.

To date I have tried doing a CRC32 on each value and then a CRC32 on that but I am still getting duplicates.

With the size of tables today I am simply amazed that this has not come up before.

Any suggestions would be appreciated.

leslie
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hi Leslie,

I disagree with you on that. I just tested the CRC32 routine for the two records in your post and I'm getting two unique values.

For the first record I get :

Code: Select all


TEST #1
*******
 
Arg1 = 1
Arg2 = 2005-09-09 15:43:09.820
Arg3 = 19089772
Arg4 = CDS_REDEMPTION
Arg5 = 1210622
Arg6 = +
Arg7 = 1
 
Test completed.
 

Result = -527945218

For the second record I get:

Code: Select all


TEST #2
*******
 
Arg1 = 1
Arg2 = 2006-01-22 02:53:52.147
Arg3 = 23515515
Arg4 = CDS_ACCOUNT
Arg5 = 144629303
Arg6 = +
Arg7 = 1
 
Test completed.
 

Result = -323520248

I would like to know how exactly you are using the CRC32 routine. That would be helpful. But CRC32 does work good.

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

Hi Naveen,

I think Leslie's observation is correct, since I am also getting the same results as she got it.

I used a transformer and a output link to a sequential file.
The sequential file had two columns
the Derivation to both of the columns was

Code: Select all

COL1 : CRC32('1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1')

Code: Select all

COL2 :CRC32('1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1')
The result was the same:
-1614948330
Success consists of getting up just one more time than you fall.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Leslie,

Although I'm not at a DataStage client right now and can't confirm the CRC32 collision, I must admit that I'm surprised that two such close strings will have an identical CRC32 return value.

My faith in the CRC32 method would be greatly diminished if you were to find a second pair of strings with the same CRC32 or a third string that matches your output, albeit my first thought would be that the implementation of the algorithm was corrupted in DataStage.

The CRC32 algorithm is not too complex to understand and is, in my opinion, an elegant computational masterpiece. It uses an internal lookup table with binary values seeded from a constant value and will actually use every byte in the original string to compute the return value.

The resulting number has 4,294,967,296 possible values and the algorithm is designed so that strings with just one letter difference will yield very different results.

Your CRC32 collision is a one-in-4.2Billion chance (although I think the actual odds are less, since the string is short). Any algorithm that is "lossy" (i.e. reduces the amount of information stored) will yield a duplicate sometime; the better the algorithm, the less the chance of that happening.
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Yes! I just realized that I had tested the CRC32 routine with the above test cases this :

Code: Select all


      Ans = CRC32('1':'2006-01-22 02:53:52.147':'23515515':'CDS_ACCOUNT':'144629303':'+':'1'): " ****** " : CRC32('1':'2005-09-09 15:43:09.820':'19089772':'CDS_REDEMPTION':'1210622':'+':'1')

in a routine and it returned me:

Code: Select all


TEST #1
*******
 
Arg1 = 
 
Test completed.
 

Result = -323520248 ****** -527945218

But when tested it with the pipes(the whole original line posted by Leslie), which I omitted above thinking that the de-limiter should be omitted. Then for this code, I get:

Code: Select all


Ans = CRC32('1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1'): " ****** " : CRC32('1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1')

Code: Select all


TEST #1
*******
 
Arg1 = 
 
Test completed.
 

Result = -1614948330 ****** -1614948330

Which proves that Leslie is right. And I'm very curious to know why is this happening, apart from the disappointment that CRC32 failed. It is one of my favourites :(

Is this really a case of the odds of CRC32 going wrong coming right?Gurus....Please shed some light :idea: on it.

Many Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Bryceson
Charter Member
Charter Member
Posts: 88
Joined: Wed Aug 03, 2005 1:11 pm
Location: Madison, WI

Post by Bryceson »

Folks,

I do not know much about CheckSum or CRC32 function. What is the difference between the two??. When should we use them or one vs the other??

Thanks in advance for your knowledge!

Bryceson
lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

Post by lclapp »

I have a 6 million row input which results in over 2000 CRC32 duplicates. What you will find interesting is that if you start removing data from my supplied examples you will finally get unique answers...more is not better. CRC32 has been around forever. I find it hard to believe that the DS community has not run into this before. I have tried wrapping CRC32 around each column and then doing a CRC32 of the result but the number of duplicates increases dramatically.

Does anyone feel like writing an updated CRC32...say a CRC64 or one using the MD5 hashing algorithm. I simply don't have the skill to do so.

This is put a real stop on my development at this point.
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Leslie,

I tried to use a the 'comma' de-limiter rather than the 'pipe' de-limiter.

Here are the results (They are unique):


Code: Select all


Ans = 'COL1': '=':CRC32('1,2006-01-22 02:53:52.147,23515515,CDS_ACCOUNT,144629303,+,1'):' , ':'COL2': '=' :CRC32('1,2005-09-09 15:43:09.820,19089772,CDS_REDEMPTION,1210622,+,1')

Code: Select all


TEST #1
*******
 
Arg1 = 
 
Test completed.
 

Result: COL1 = -963251981 , COL2 = -1697917724

Is '|' the culprit?
Try running your job by substituting "," instead of "|".
I'm thinking there would be no duplicates this time.

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

Post by lclapp »

1,2005-08-19 13:17:10.883,18527139,CDS_REDEMPTION,1179677,+,1
1,2005-08-29 10:48:50.470,18776558,CDS_PLAYER,144612819,+,1

Both results in CRC value of 160456061

So the '|' is not the problem.

leslie( BTY just for those who don't know I am MWM better looking than Dennis or Mr. Ray or Mr. Duke.....)
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

:cry: That is disastrous.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hi Leslie,

Can you try this last one :? :

Code: Select all


1,2005-08-19 131710.883,18527139,CDS_REDEMPTION,1179677,+,1

instead of

Code: Select all


1,2005-08-19 13:17:10.883,18527139,CDS_REDEMPTION,1179677,+,1

Remove the :'s from the timestamp. But I feel this one is a long shot too. Can you e-mail me the 2000 duplicates that your are getting?

My e-mail address is : ndronavalli@gmail.com, ndronavalli@amre.com

Many Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Actually, try it without the commas between the fields. Don't understand exactly why, but with commas = same checksum while without commas = totally different checksums.
-craig

"You can never have too many knives" -- Logan Nine Fingers
lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

Post by lclapp »

If you don't put a some qualifier then you run the risk of getting the same CRC32 results(!!???) for different rows...ie 123 in col1 and 4 in col2 (123,4) should be different than 12 in col1 and 34 in col2 (12,34).
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Pad them (each field) to a consistant size.
-craig

"You can never have too many knives" -- Logan Nine Fingers
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Usually, when somebody uses CRC32 they concatenate the fields and then pass them to the CRC32 routine. I strongly suggest that you try that method and also as suggested by Craig.

For example:

Code: Select all


CRC32('1':'2006-01-22 02:53:52.147':'23515515':'CDS_ACCOUNT':'144629303':'+':'1')

Code: Select all


CRC32(col1 : col2 : col3 : col4 : col5 : col6 : col7)

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Post Reply