Page 1 of 1

Duplicate Key values in CDC Stage

Posted: Thu Jun 07, 2018 4:18 pm
by phanikumar
Hi All,

Can CDC stage handle duplicates.. We have a scenario where duplicate values coming in for KEY column.. The job produce different results upon running multiple times..

The job does join on KEY column and produce multiple records for the same key column.. But the change codes are not consistent when run multiple times..

The KEY column is HASH partitioned and sorted from both the links..

Can any one please help me understand why would it produce different resluts..

Regards
Kumar

Posted: Fri Jun 08, 2018 1:09 am
by rbk
Not sure but this is a scenario that I have noticed as well.

I have noticed in cases where we have duplicates in the source (after), the first record gets identified as a copy (assuming the data is available in the reference/before as well) and the second record gets identified as an insert. Not sure why it does that. Would be nice to get an understanding of how exactly the CDC stage works. Also I think it is better to not have duplicates in the source and reference considering that we are trying to identify the changes. Do let us know if you come across any solutions...

Posted: Fri Jun 08, 2018 4:23 pm
by Mike
I'm not sure if this is documented or if it is just something I know from experience.

The change capture stage requires unique keys on its inputs.

This makes perfect sense if you think about the classic two file match logic that probably happens under the covers where a key match results in the next record from each file being read before the next key comparison.

Having said that... it is still possible to handle multiple version changes for a given key in a single job execution utilizing the change capture stage.

It just takes a little creativity to turn the duplicate keys into the unique keys that the stage requires.

Mike

CDC or Change Capture

Posted: Wed Jun 13, 2018 2:13 pm
by rameshrr3
IDK why its common to refer to Change Capture stage as CDC stage, because it creates quite a confusion with the CDC Transaction Stage.

Posted: Thu Jun 14, 2018 5:07 am
by qt_ky
I agree. It is a common misnomer. Clearly, the Change Capture stage would be abbreviated CC. CDC is different.

Posted: Thu Jun 14, 2018 7:18 am
by chulett
So basically it's CDD or Change Data Detection? That's what I've known it as and as noted it's a distinctly different process than Change Data Capture.

Posted: Fri Jun 15, 2018 5:33 am
by qt_ky
Yes, the Change Capture stage performs change data detection, but watch out... because "CDD" is another IBM product acronym for Change Data Delivery! :shock:

Posted: Fri Jun 15, 2018 6:21 am
by chulett
Great. Now we need ACD - Acronym Collision Detection.