CRC32 Does have limitations (severe in my opinion)
Moderators: chulett, rschirm, roy
CRC32 Does have limitations (severe in my opinion)
Per DataStage customer support:javascript:
"The CRC32 function in DataStage does support character strings above 32 bits. There is a limitation on checksums to determine duplicate strings and it can return duplicate values."
I have several tables which do not have any keys assigned. I built a crc32 derivation using pipes between the columns.
Sample data:
1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1
1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1
The resulting CRC32 value for both input is -1614948330.
To date I have tried doing a CRC32 on each value and then a CRC32 on that but I am still getting duplicates.
With the size of tables today I am simply amazed that this has not come up before.
Any suggestions would be appreciated.
leslie
"The CRC32 function in DataStage does support character strings above 32 bits. There is a limitation on checksums to determine duplicate strings and it can return duplicate values."
I have several tables which do not have any keys assigned. I built a crc32 derivation using pipes between the columns.
Sample data:
1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1
1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1
The resulting CRC32 value for both input is -1614948330.
To date I have tried doing a CRC32 on each value and then a CRC32 on that but I am still getting duplicates.
With the size of tables today I am simply amazed that this has not come up before.
Any suggestions would be appreciated.
leslie
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Hi Leslie,
I disagree with you on that. I just tested the CRC32 routine for the two records in your post and I'm getting two unique values.
For the first record I get :
For the second record I get:
I would like to know how exactly you are using the CRC32 routine. That would be helpful. But CRC32 does work good.
Thanks,
Naveen.
I disagree with you on that. I just tested the CRC32 routine for the two records in your post and I'm getting two unique values.
For the first record I get :
Code: Select all
TEST #1
*******
Arg1 = 1
Arg2 = 2005-09-09 15:43:09.820
Arg3 = 19089772
Arg4 = CDS_REDEMPTION
Arg5 = 1210622
Arg6 = +
Arg7 = 1
Test completed.
Result = -527945218
Code: Select all
TEST #2
*******
Arg1 = 1
Arg2 = 2006-01-22 02:53:52.147
Arg3 = 23515515
Arg4 = CDS_ACCOUNT
Arg5 = 144629303
Arg6 = +
Arg7 = 1
Test completed.
Result = -323520248
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Hi Naveen,
I think Leslie's observation is correct, since I am also getting the same results as she got it.
I used a transformer and a output link to a sequential file.
The sequential file had two columns
the Derivation to both of the columns was
The result was the same:
-1614948330
I think Leslie's observation is correct, since I am also getting the same results as she got it.
I used a transformer and a output link to a sequential file.
The sequential file had two columns
the Derivation to both of the columns was
Code: Select all
COL1 : CRC32('1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1')
Code: Select all
COL2 :CRC32('1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1')
-1614948330
Success consists of getting up just one more time than you fall.
Leslie,
Although I'm not at a DataStage client right now and can't confirm the CRC32 collision, I must admit that I'm surprised that two such close strings will have an identical CRC32 return value.
My faith in the CRC32 method would be greatly diminished if you were to find a second pair of strings with the same CRC32 or a third string that matches your output, albeit my first thought would be that the implementation of the algorithm was corrupted in DataStage.
The CRC32 algorithm is not too complex to understand and is, in my opinion, an elegant computational masterpiece. It uses an internal lookup table with binary values seeded from a constant value and will actually use every byte in the original string to compute the return value.
The resulting number has 4,294,967,296 possible values and the algorithm is designed so that strings with just one letter difference will yield very different results.
Your CRC32 collision is a one-in-4.2Billion chance (although I think the actual odds are less, since the string is short). Any algorithm that is "lossy" (i.e. reduces the amount of information stored) will yield a duplicate sometime; the better the algorithm, the less the chance of that happening.
Although I'm not at a DataStage client right now and can't confirm the CRC32 collision, I must admit that I'm surprised that two such close strings will have an identical CRC32 return value.
My faith in the CRC32 method would be greatly diminished if you were to find a second pair of strings with the same CRC32 or a third string that matches your output, albeit my first thought would be that the implementation of the algorithm was corrupted in DataStage.
The CRC32 algorithm is not too complex to understand and is, in my opinion, an elegant computational masterpiece. It uses an internal lookup table with binary values seeded from a constant value and will actually use every byte in the original string to compute the return value.
The resulting number has 4,294,967,296 possible values and the algorithm is designed so that strings with just one letter difference will yield very different results.
Your CRC32 collision is a one-in-4.2Billion chance (although I think the actual odds are less, since the string is short). Any algorithm that is "lossy" (i.e. reduces the amount of information stored) will yield a duplicate sometime; the better the algorithm, the less the chance of that happening.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Yes! I just realized that I had tested the CRC32 routine with the above test cases this :
in a routine and it returned me:
But when tested it with the pipes(the whole original line posted by Leslie), which I omitted above thinking that the de-limiter should be omitted. Then for this code, I get:
Which proves that Leslie is right. And I'm very curious to know why is this happening, apart from the disappointment that CRC32 failed. It is one of my favourites
Is this really a case of the odds of CRC32 going wrong coming right?Gurus....Please shed some light on it.
Many Thanks,
Naveen.
Code: Select all
Ans = CRC32('1':'2006-01-22 02:53:52.147':'23515515':'CDS_ACCOUNT':'144629303':'+':'1'): " ****** " : CRC32('1':'2005-09-09 15:43:09.820':'19089772':'CDS_REDEMPTION':'1210622':'+':'1')
Code: Select all
TEST #1
*******
Arg1 =
Test completed.
Result = -323520248 ****** -527945218
Code: Select all
Ans = CRC32('1|2006-01-22 02:53:52.147|23515515|CDS_ACCOUNT|144629303|+|1'): " ****** " : CRC32('1|2005-09-09 15:43:09.820|19089772|CDS_REDEMPTION|1210622|+|1')
Code: Select all
TEST #1
*******
Arg1 =
Test completed.
Result = -1614948330 ****** -1614948330
Is this really a case of the odds of CRC32 going wrong coming right?Gurus....Please shed some light on it.
Many Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
I have a 6 million row input which results in over 2000 CRC32 duplicates. What you will find interesting is that if you start removing data from my supplied examples you will finally get unique answers...more is not better. CRC32 has been around forever. I find it hard to believe that the DS community has not run into this before. I have tried wrapping CRC32 around each column and then doing a CRC32 of the result but the number of duplicates increases dramatically.
Does anyone feel like writing an updated CRC32...say a CRC64 or one using the MD5 hashing algorithm. I simply don't have the skill to do so.
This is put a real stop on my development at this point.
Does anyone feel like writing an updated CRC32...say a CRC64 or one using the MD5 hashing algorithm. I simply don't have the skill to do so.
This is put a real stop on my development at this point.
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Leslie,
I tried to use a the 'comma' de-limiter rather than the 'pipe' de-limiter.
Here are the results (They are unique):
Is '|' the culprit?
Try running your job by substituting "," instead of "|".
I'm thinking there would be no duplicates this time.
Thanks,
Naveen.
I tried to use a the 'comma' de-limiter rather than the 'pipe' de-limiter.
Here are the results (They are unique):
Code: Select all
Ans = 'COL1': '=':CRC32('1,2006-01-22 02:53:52.147,23515515,CDS_ACCOUNT,144629303,+,1'):' , ':'COL2': '=' :CRC32('1,2005-09-09 15:43:09.820,19089772,CDS_REDEMPTION,1210622,+,1')
Code: Select all
TEST #1
*******
Arg1 =
Test completed.
Result: COL1 = -963251981 , COL2 = -1697917724
Try running your job by substituting "," instead of "|".
I'm thinking there would be no duplicates this time.
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Hi Leslie,
Can you try this last one :
instead of
Remove the :'s from the timestamp. But I feel this one is a long shot too. Can you e-mail me the 2000 duplicates that your are getting?
My e-mail address is : ndronavalli@gmail.com, ndronavalli@amre.com
Many Thanks,
Naveen.
Can you try this last one :
Code: Select all
1,2005-08-19 131710.883,18527139,CDS_REDEMPTION,1179677,+,1
Code: Select all
1,2005-08-19 13:17:10.883,18527139,CDS_REDEMPTION,1179677,+,1
My e-mail address is : ndronavalli@gmail.com, ndronavalli@amre.com
Many Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Usually, when somebody uses CRC32 they concatenate the fields and then pass them to the CRC32 routine. I strongly suggest that you try that method and also as suggested by Craig.
For example:
Thanks,
Naveen.
For example:
Code: Select all
CRC32('1':'2006-01-22 02:53:52.147':'23515515':'CDS_ACCOUNT':'144629303':'+':'1')
Code: Select all
CRC32(col1 : col2 : col3 : col4 : col5 : col6 : col7)
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE