Generate schema files in datastage

skp · Post by **skp** » Wed Feb 03, 2016 1:49 am

Hi,

We have a requirement that, we will receive 25 different files (.csv), for each file metadata is different. We need to create a single job which can handle for all 25 files.

Any idea on how we can do this please?

Thomas.B · Post by **Thomas.B** » Wed Feb 03, 2016 3:56 am

First we need to know what do you want your job to do with these 25 files ? Then what kind of metadata do you have on these files (string, decimal, ...) ?
Do you always have the same number of fields on each file ?

As you said in the title, a solution is to use a schema file for each CSV file with the CSV file name and the schema file name as job parameters.

chulett · Post by **chulett** » Wed Feb 03, 2016 8:52 am

... along with RCP but as noted it really depends on what you need to do after you read the file. What's your target, by the way?

skp · Post by **skp** » Fri Feb 05, 2016 12:31 pm

Our source file can come with dynamic structure like For Jan file we will get from Feb to Dec fields plus next 3 years data (Total 48 fielda including id) and for Feb file we will get from Mar to Dec fields plus next 3 years data (Total 47 fields including I'd) and so on for Dec file we will get next 4 years data (Total 49 fields including I'd).

We need to split data into 4 records for each incoming record.

For Jan file data will be like:
I'd,Feb16,Mar16,Apr16,may16,Jun16,jul16,Aug16,Sep16,Oct16,nov16,Dec16
,Jan17,Feb17,Mar17,Apr,17,May17,Jun17,Jul17,Aug17,Sep17,Oct17,Nov17,Dec17
,Jan18,Feb18,Mar18,Apr18,May18,Jun18,Jul18,Aug18,Sep18,Oct18,Nov18,Dec18
,Jan19,Feb19,Mar18,Apr19,May19,Jun19,Jul19,Aug19,Sep19,Oct19,Nov19,Dec19
1000,10,20,30,40,50,60,70,80,90,100,111,222,20,22,33,44,55,66,77,88,89,99,00,11,11,54,45,76,87,76,56,76,90,97,65,66
,43,55,76,99,44,21,43,56,76,87,98,00

Output:
I'd,Jan,Feb,Mar,Apr,may,Jun,jul,Aug,Sep,Oct,nov,Dec,year
1000,,20,30,40,50,60,70,80,90,100,111,222,2016
1000,20,22,33,44,55,66,77,88,89,99,00,11,2017
1000,11,54,45,76,87,76,56,76,90,97,65,66,2018
1000,43,55,76,99,44,21,43,56,76,87,98,00,2019

Previous months in the current year will not get the data from the file but while loading we need to populate blank for previous months in the current year. Our our target should be segregate the data into 4 years for each record.

We have tried to create 12 schema files for 12 different file formats and loaded data into dataset but while doing segregation we are not able to handle for current year because for Jan month file we will not get the Jan field so not able to map this field value dynamically.

Any ideas on how we can achieve this please?

rkashyap · Post by **rkashyap** » Fri Feb 05, 2016 9:16 pm

You can add extra 'leading' Null columns (for missing months) to create a file with a consistent metadata i.e. constant no of columns.

This can be done by reading the file in a single column and then splitting it in a transformer by parsing from right to left using FIELD function; (last column = 48th month; (last-1) column = 47th month ...; and any missing columns be set to Null).

Alternatively passing following command in FILTER condition of Sequential file stage, you will also get four output lines with Id and twelve Data columns each

Code: Select all

awk -F"," '{printf $1 ","; for (i=0; i<=(48-NF); i++){printf ","}; for (j=2; j<=NF; j++){printf $j "," }; print NL}'| awk -F"," '{printf $1; for (i=2; i<50; i++) {if ((i == 14) || (i == 26) || (i == 38)) {printf  "\n" $1 "," $i} else {printf "," $i}} print NF}''

skp · Post by **skp** » Sat Feb 06, 2016 1:37 am

Hi,

Thanks for your quick response, but when we read complete data in one line using sequential file in transformer how to identify missing columns, for Jan data Jan filed will be missing from file and for Feb data Jan and Feb fields are missing and for Mar data. Jan,Feb and Mar fields are missing. How to identify dynamically in transformer?

I will try second option but can you let me know whether we need to apply it in source sequential file stage?

rkashyap · Post by **rkashyap** » Sun Feb 07, 2016 12:21 pm

skp wrote:when we read complete data in one line using sequential file in transformer how to identify missing columns, for Jan data Jan filed will be missing from file and for Feb data Jan and Feb fields are missing and for Mar data. Jan,Feb and Mar fields are missing. How to identify dynamically in transformer?

=> Number of missing data fields can be determined by subtracting COUNT of data fields from max possible (49). eg if there are 46 fields in the record, then first three month values would be Null. One of the way to determine no of fields is to read record in a single column, determine count of commas/delimiters and add one to it.

I will try second option but can you let me know whether we need to apply it in source sequential file stage?

Yes. Try the command in Unix first and (if needed) tweak the awk expression.

ShaneMuir · Post by **ShaneMuir** » Mon Feb 08, 2016 10:06 am

What is your ultimate target schema? It might be that you do not need to use a schema file, but rather you could just split them data into the relevant target columns, using a simple collection of loops and stage variables.

skp · Post by **skp** » Tue Feb 09, 2016 12:54 am

Hi rkashyap,

My source data will not stick to number of years, means it may vary the structure, like we may get 4 or 5 or 6 or 7 etc... years data from our file. So any idea how we can implement this in Datastage. I have tried first option for 4 years forecasting data which works but if years data will get dynamic how can we handle this?

skp · Post by **skp** » Tue Feb 09, 2016 6:21 am

I have done this with Transformer looping and pivoting the data.

First in transformer I am getting the number of fields exist and based on that performing iteration and then in pivot job performing pivot on array size 12 now my output is as expected.

But only one thing left that is Year part, I am not able to generate year dynamically. For ex: if I have 5 years data my record should be like below.

Account_ID,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
1000,00,20,30,40,50,60,70,80,90,100,111,222,2016
1000,20,22,33,44,55,66,77,88,89,99,00,11,2017
1000,11,54,45,76,87,76,56,76,90,97,65,66,2018
1000,43,55,76,99,44,21,43,56,76,87,98,00,2019
1000,44,54,77,98,43,22,44,55,77,88,99,10,2019

How to do this

chulett · Post by **chulett** » Tue Feb 09, 2016 8:18 am

Well... where does the starting year come from - the system date? Something in the file or filename? Seems to me if you have the starting year then you just increment it by one for each subsequent row. Of course, I could be greatly simplifying the process.

Perhaps if you posted some specifics around this last issue?

rkashyap · Post by **rkashyap** » Tue Feb 09, 2016 3:36 pm

Examples are good, but precise requirement is more helpful

Same as Craig has mentioned, based on your examples, starting year seems to be current year, so get it from system date and increment one for every new output record.

However, if requirement is to get year from Header row, then read the file in sequential mode; in transformer's stage variable, extract year from the columnname of header row(16 from Feb16 of your example). Subsequently you can add centuary to it and and populate it in output, incrementing one for new row.

Going to previous discussion about sequential file stage's Filter option ... awk can also handle data for varying number of years.

Code: Select all

awk -F"," '{printf $1 ","; for (i=0; i<=(((int((NF-2)/12)+1)*12)-NF); i++){printf ","}; for (j=2; j<=NF; j++){printf $j "," }; print NL}'|nawk -v yr="$(date +'%Y')" -F"," '{for (i=2; i<(NF); i++) {if (((i-2) % 12)==0) {printf  "\n" $1 "," yr+int((i-2)/12) "," $i} else {printf "," $i}}}'