Comparing two dynamic data files with header using datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Srimeenadevi
Participant
Posts: 15
Joined: Thu May 31, 2007 3:36 am

Comparing two dynamic data files with header using datastage

Post by Srimeenadevi »

Hi,

The scenario is like this

file1.csv

Customer_id,customer_name,parts_name,vehicle_name,parts_no

file2.csv

Customer_id,customer_name,vehicle_name,parts_no

Here in both the files, parts_no is the key field. Two files come from external systems. Both the files are dynamic and columns can be swapped for each run. Sometimes one field - parts_name can be missing in the file.

How can we handle this and proceed with our jobs.

1) Bringing both the files into a proper order
2) In case one field is not coming in either of the file reject the file saying improper format.

Can this be done exclsuively in datastage itself.

Please suggest.
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

I'm not sure you can exclusively handle this through Datastage alone, instead I'd look to check the file format through scripting and produce a relevant schema file to use on input to the initial Datastage job reading the source in. Look at schema files and partial schemas.
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
Srimeenadevi
Participant
Posts: 15
Joined: Thu May 31, 2007 3:36 am

Post by Srimeenadevi »

Yes, I too thought of a similar fashion. Thought would have a reference_file.csv and compare the two data files - file1.csv and file2.csv.

Change the files into the format of reference_file.csv and proceed with our jobs.

But in case of one column not coming in the input file, i.e., parts_name is not coming in the file, I have to return the file back to the users, which I am not sure how to handle it.

Code: Select all

#!/bin/bash

DATAFILE=${1:-data.txt}
COLUMNFILE=${2:-list.txt}

awk -F, -v colsFile="$COLUMNFILE" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$DATAFILE"
file1.csv
Customer_id,customer_name,parts_name,vehicle_name,parts_no
1,abc,wheel,cycle,12
2,def,screw,car,15

file2.csv
Customer_id,customer_name,vehicle_name,parts_no
1,abc,cycle,12
2,frt,car,20

reference_file.csv
Customer_id
customer_name
parts_no
parts_name
vehicle_name

Here in file2.csv, the parts_name column is missing. Need some help to see how this can be handled in script. The file should be rejected.


Thanks.
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

For the handling part, unless I am misunderstanding something, it seems relatively simple. Does the required/expected number of columns not remain the same? If so, a count of the columns in the header can be checked. A failure to match can result in the process exiting as a failure, which you can handle as necessary to communicate this back to the user.
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
Srimeenadevi
Participant
Posts: 15
Joined: Thu May 31, 2007 3:36 am

Post by Srimeenadevi »

That looks like a better idea. That can be handled separately in the initial pace and send a notification to users.
Post Reply