DS Best Practices CDD question

Billy · Post by **Billy** » Fri Aug 22, 2003 9:49 pm

Hi All,

I've been going thru the course materials and I noticed that the CDD (Change Data Detection) method requires the extraction of the entire table into bkt, src and hash files.

Based on all your vast and many years of experience, is this the best CDD method for a DBMS w/out a timestamp feature? Wouldn't the extraction of the entire table itself would require a large amount of time?

Another thing which I noticed was that there was no CDD for DBMS w timestamp feature. Usually a Dim or Fact table would require many JOINs and UNIONs which may extract new/updated combined w old records. What is the practised and recommended CDD for this type of scenario?

cheers
Billy

kcbland · Post by **kcbland** » Fri Aug 22, 2003 10:17 pm

For those who don't know the frame of reference for this question, it's regarding the DS306 course offered by Ascential.

I've been going thru the course materials and I noticed that the CDD
(Change Data Detection) method requires the extraction of the entire
table into bkt, src and hash files

I think you misread the documention, or it's changed since I wrote it for Ascential.

There's basically three ways to determine if a row of data has changed and requires loading into your warehouse:

1. Fully transform it and see if it's different from what's in the warehouse. This is metaphysical certitude, the row has changed because it doesn't match what's in the warehouse.
2. The source system identifies a row as having been inserted or updated (LAST_UPDATE_DT column, trigger capture table, transaction log, etc). This works sometimes, as long as this source table is all that is required to transform the row to the target; ie a join or reference to another table may contain a change and therefore you don't see that the fully transformed row is different.
3. There is no method from #2 available to "tell" you a row has changed. The only option for you to use is to "dump" the table to a file and compare it to a previous "dump" to see what is different. This can often be accomplished via an ordered dump to a file, and a unix "diff todaysfile yesterdaysfile > deltasfile" type statement, in addition to archiving each day's dump file. This basically get's you back into an incremental update process like #1.

So, the choices are few, and you have to figure out what's available to you and what's practical. Method #1 is the absolute most certain, but you can't extract all of your data every day. Method #2 is great, if you have last update date type information to use. Method #3 is great for smaller tables, a handfull of millions of rows, but not a scaling solution.

Based on all your vast and many years of experience, is this the best
CDD method for a DBMS w/out a timestamp feature? Wouldn't the
extraction of the entire table itself would require a large amount
of time?

My recommendation is to do Method #2 where possible. Then, beg for some means to identify the data as changed. Otherwise, Method #1 is used where your volume is reasonable. The problem here is that this is computationally intensive. In some cases it may be quicker to do Method #3.

Another thing which I noticed was that there was no CDD for DBMS w
timestamp feature. Usually a Dim or Fact table would require many
JOINs and UNIONs which may extract new/updated combined w old
records. What is the practised and recommended CDD for this type of
scenario?

This is Method #1 I listed above. Even though a row didn't change in a driving table, a reference table has a change. Only once you have fully transformed the row do you actually realize the change. Unfortunate, but true. Most often, you are not going back to facts and reprocessing them because reference tables have changed over time (and that's called slowly changing dimensions, and you shouldn't be restating history). With regards to dimensions, I have to do this all the time for clients. I learned my lesson well the first time I put an incremental select on last_update_dt in a source table. Unfortunately, a reference join was introducing changes, but I was missing them because my driving table select didn't pickup the deltas. I had to pull the full table every time and fully transform each row and then compare with the warehouse row to see if it was different. This is intensive and time costly, but job instantiation and 18 cpus made it work.

Good luck!

Kenneth Bland

mhester · Post by **mhester** » Sat Aug 23, 2003 9:41 am

Billy,

What Ken describes is true with #2 being the best solution if available. If this is not the case then method #1 is probably what you will have to use and with the introduction of CRC32 it is much less painful than to compare field vs. field or column vs. column. CRC32 was designed for this very purpose and is based on the same CRC algorithm used in HTTP, FTP etc... What this means to you is that there is a 1 in 4 billion chance that the CRC value generated is incorrect or a duplicate and if you understand statistics then you will understand that if you process 4 billion rows it does not mean you will have a failure, rather each row has a 1 in 4 billion chance.

CRC32 is very efficient and the bigger or wider the row the better. If you pass a string (the entire row) a signed 32 bit number will be returned and you can use this to test if a row has changed. This is much more secure or safe than checksum or UNIX diff or any of these other methods since these do not detect small changes in large sources of data.

I would be more than happy to send you all of the white papers regarding this method of CDD. If you are interested in the math then you would find this interesting. I also have a sample DSX that incorporates its use if you would like that also.

Regards,

Michael Hester

Billy · Post by **Billy** » Tue Aug 26, 2003 12:16 am

Hi Micheal,

I definitely welcome any material regarding CDD, tips and tricks relating to ETL. But let's start with the lighter materials, fairly new to this area [:)]

cheers
Billy

ray.wurlod · Post by **ray.wurlod** » Tue Aug 26, 2003 4:25 pm

It would be really useful if these white papers were on the Ascential web site.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

mhester · Post by **mhester** » Wed Aug 27, 2003 7:04 am

Ray,

When I originally submitted the CRC32 C program and white paper to Ascential it was my understanding that this would be the case. I have never searched the website to find out if they did post it. I will contact Ascential and check to see if they intend to post this information.

Is there a place on this website or forum to post material like this? It is not proprietary since the algorithm and code is published and not owned by Ascential.

Regards,

Michael Hester

DCJ · Post by **DCJ** » Wed Aug 27, 2003 11:25 pm

Mike and Ray,

You may or may not be aware, but at the moment, we are awaiting some legal status from Ascential on the posting of white papers in general. Since it could be "illegal" to post this material without permission, I would appreciate that we mark this post to a favorite's section and check back in a short while. We hope to have this resolved soon with good news for everyone. Thanks....

Dennis

mhester · Post by **mhester** » Thu Aug 28, 2003 6:18 am

DCJ,

There is no "legal" position regarding these white papers. Ascential did not develop the algorithm or the white papers that are being referenced here. The papers, code, and process are public domain and can be distributed without fear of legal action. The algorithm for CRC32 is public domain and is freely available.

I agree that if Ascential develops a white paper regarding CRC32 and its implementation within DS then this should not be posted until all legal issues are satisfied, however the material I referenced via this post was written by mathematicians outside the realm of ETL and therefore does not and should not be censored by this forum.

Regards,

Michael Hester

dickfong · Post by **dickfong** » Wed Sep 17, 2003 10:08 pm

I have some questions regarding the crc32() function.

1. How is it different from the DS function checksum() and UNIX function cksum?
2. What is the largest possible number that crc32() can generate?

Thank you

mhester · Post by **mhester** » Thu Sep 18, 2003 6:40 am

CRC and Checksum are totally different animals. Checksum within UV only generates a 16 bit unsigned integer and I don't know what the Unix cksum generates, but would be willing to guess that it is either 16 or 32 bit. (I believe there may even be a 32 bit checksum within UV, although I'm not sure).

The sequential addition of ASCII values of data and the modulo arithmetic used in the calculation of a checksum give it a reliability of approximately 99.6 percent. While the mathematical equation for crc is -

X^32 + X^26 + X^23 + X^22 + X^16 + X^12 + X^11 + X^10 + X^8 + X^7 + X^5 + X^4 + X^2 + X + 1

This affords a near 100% accuracy or error detection rate.

Example:

If you have two files or pieces of data and generate a CRC value for both and these values are different then you can be 100% guaranteed that the data is different. If the values are the same then you can be 99.9999% sure that the files are the same and this small chance is because an incorrect result may be returned and to find the difference you would have to iterate byte-by-byte through the data to verify.

Some version of CRC (12, 16, CRC-CCITT OR 32) is used in most communications protocols like FTP, HTTP, HTTPS etc...

The value returned by CRC32 is a 32 bit signed integer and if you do the math that equates to a 1 in ~ 4 billion chance of an incorrect or spurious CRC value being generated while checksum would be 1 in ~64,000.

Think of a CRC value as a "digital signature" of the data being checked. You can generate a CRC value against a single byte of data, a block of data (group of columns), a file or a memory location.

The algorithm for CRC is not additive as in the checksum, but rather based on coefficients and polynomials.

If you truly want to identify small or large changes in your data then there are really only two methods to ensure that these changes are detected -

1) CDC on the source system or application CDC so that you only get the changed, added or deleted records from the extract

2) CRC because it is fast, accurate and leaves little room for doubt when used properly.

Scanning a row via UV BASIC and comparing fields or columns is slow, prone to error and requires coding changes when new fields are added or removed.

Hope this helps,

Michael Hester