Comparing two dynamic data files with header using datastage

Srimeenadevi · Post by **Srimeenadevi** » Tue Jun 18, 2013 3:52 am

Hi,

The scenario is like this

file1.csv

Customer_id,customer_name,parts_name,vehicle_name,parts_no

file2.csv

Customer_id,customer_name,vehicle_name,parts_no

Here in both the files, parts_no is the key field. Two files come from external systems. Both the files are dynamic and columns can be swapped for each run. Sometimes one field - parts_name can be missing in the file.

How can we handle this and proceed with our jobs.

1) Bringing both the files into a proper order
2) In case one field is not coming in either of the file reject the file saying improper format.

Can this be done exclsuively in datastage itself.

Please suggest.

miwinter · Post by **miwinter** » Tue Jun 18, 2013 4:02 am

I'm not sure you can exclusively handle this through Datastage alone, instead I'd look to check the file format through scripting and produce a relevant schema file to use on input to the initial Datastage job reading the source in. Look at schema files and partial schemas.

Srimeenadevi · Post by **Srimeenadevi** » Tue Jun 18, 2013 4:19 am

Yes, I too thought of a similar fashion. Thought would have a reference_file.csv and compare the two data files - file1.csv and file2.csv.

Change the files into the format of reference_file.csv and proceed with our jobs.

But in case of one column not coming in the input file, i.e., parts_name is not coming in the file, I have to return the file back to the users, which I am not sure how to handle it.

Code: Select all

#!/bin/bash

DATAFILE=${1:-data.txt}
COLUMNFILE=${2:-list.txt}

awk -F, -v colsFile="$COLUMNFILE" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$DATAFILE"

file1.csv
Customer_id,customer_name,parts_name,vehicle_name,parts_no
1,abc,wheel,cycle,12
2,def,screw,car,15

file2.csv
Customer_id,customer_name,vehicle_name,parts_no
1,abc,cycle,12
2,frt,car,20

reference_file.csv
Customer_id
customer_name
parts_no
parts_name
vehicle_name

Here in file2.csv, the parts_name column is missing. Need some help to see how this can be handled in script. The file should be rejected.

Thanks.

miwinter · Post by **miwinter** » Tue Jun 18, 2013 4:32 am

For the handling part, unless I am misunderstanding something, it seems relatively simple. Does the required/expected number of columns not remain the same? If so, a count of the columns in the header can be checked. A failure to match can result in the process exiting as a failure, which you can handle as necessary to communicate this back to the user.

Srimeenadevi · Post by **Srimeenadevi** » Tue Jun 18, 2013 4:37 am

That looks like a better idea. That can be handled separately in the initial pace and send a notification to users.