Different results in 8.7 job than 8.1 version job

nikhil_bhasin · Post by **nikhil_bhasin** » Thu Dec 12, 2013 1:47 pm

Hi All,

I am facing a unique issue while migrating 8.1 version jobs to 8.7 version. Couple of jobs that have remove duplicate stage with hash partition are displaying difference in results when i compare 8.1 output with 8.7
Scenario is like this:-
i/p
colA,colB,colC,colD
A,B,C,1
A,B,D,2
B,C,D,1

keys for removing duplicates, hash partitioning and sorting (in remove duplicate stage partitioning tab). duplicate to retain=first
colA, colB

Results come like this:-
DS 8.1 job o/p
A,B,C,1
B,C,D,1

DS 8.7 job o/p
A,B,D,2
B,C,D,1

Every time I run both jobs the records get randomly retained (for duplicates only)

Can anyone show some way out of this situation? Would be great help.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Thu Dec 12, 2013 4:12 pm

This is very strange behavior. I can't say I've seen that problem on either of the two working 8.7 environments. It sounds like your job is configured correctly. I assume you've insured that the new job has the sorts specified in correct order (descending).

Have you switched to NLS on the new system? Can you subset some of the records in question and output them to a sequential file so you can look at them in a Hex editor? I'm wondering if there are invisible characters in the field that is causing it to sort "higher".

nikhil_bhasin · Post by **nikhil_bhasin** » Thu Dec 12, 2013 11:20 pm

The NLS settings are same for both ASCII_ASCL. And the selection of records to retain keeps on changing every time I run the jobs with same source data. Is there any change in the hash algorithm between the 2 versions.

RPhani · Post by **RPhani** » Fri Dec 13, 2013 1:25 am

Hi,

DataTypes and Lengths Of duplicate columns?CHAR or Varchar?

I think no differnce in algorithm.
----------------------
Phani

nikhil_bhasin · Post by **nikhil_bhasin** » Sun Dec 15, 2013 10:55 am

If you meant the datatypes of key columns then it is integer and date.

ArndW · Post by **ArndW** » Wed Dec 18, 2013 7:00 am

How are you sorting? Best use a sort stage and explicitly specify "Stable Sort = true" to remove the non-deterministic part of your problem.

Since the data

A,B,C,1
A,B,D,2

is only sorted on "A" and "B" the record order when not using a stable sort might be different.

nikhil_bhasin · Post by **nikhil_bhasin** » Wed Dec 18, 2013 7:24 am

@ArndW
I am using sort option in the partitioning tab of Remove Duplicate stage itself. I am not much clear about the pros and cons of using stable sort, but will try and post back results

ArndW · Post by **ArndW** » Thu Dec 19, 2013 7:59 am

If you sort the following rows that have 4 columns

A,B,C,1
B,C,D,1
A,B,D,2

on the first 2 columns using a non-stable (but faster) sort you might get a result of:

A,B,C,1
A,B,D,2
B,C,D,1

or you might get a result of:

A,B,D,2
A,B,C,1
B,C,D,1

This is due to the way the sort algorithm works internally, as it creates groups and subtrees and it might change the order of the rows for items with duplicate sort keys. Using "stable sort" guarantees that the order of rows for duplicates is identical to the source order, but a stable sort can be a lot slower and less efficient.

Hanumantharao Allada · Sat Mar 14, 2015 7:30 pm

Hi nikhil_bhasin,

Is this issue resolved ... ?

If not then can you please confirm
1) The no.of nodes that you are using in 8.1 and 8.7 for this job?
2) Is there any range lookup you are using in the job..?

weiyi_will · Post by **weiyi_will** » Sun Jun 28, 2015 11:57 pm

ArndW wrote:How are you sorting? Best use a sort stage and explicitly specify "Stable Sort = true" to remove the non-deterministic part of your problem.

Since the data

A,B,C,1
A,B,D,2

is only sorted on "A" and "B" the record order when not using a stable sort might be different.

Agree

DSXchange

Different results in 8.7 job than 8.1 version job

Different results in 8.7 job than 8.1 version job

Re: Different results in 8.7 job than 8.1 version job