Reading multiple files with same metadata from a list

sharmabhavesh · Post by **sharmabhavesh** » Thu Oct 01, 2015 10:36 pm

Hi,
I have 200-300 different files with the same metadata (which I will pass through schema file) which I want to load into a single table. Below are the limitations I have:
1. Files are in different folder
2. Each file has a header

How can I read all these files at the same time in a single job. Is there a filelist concept in Datastage through which I list all the files along with their path in a list file and Datastage automatically reads all the files in the list.

If the above is not possible, I will place all files in a single folder and use pattern match to read files. but in that case, how do I handle the header or trailer rows?

Thomas.B · Post by **Thomas.B** » Fri Oct 02, 2015 9:21 am

You can do it that way:

Create a job to load one sequential file to a table, set your input stage file property option to a job parameter.
Create a text file who lists every files you have to load
Create a sequence job like that:

Code: Select all

Execute Command --► Start loop activity ------► Job Activity
                             ▲                       |
                             |                       |
                             |                       |
                             |                       |
                             |                       ▼
				                 ----------- End loop activity

Where the execute command counts the number of lines on your text file (ex: wc -l ~/MyTextFile.txt), the job activity is the job you previously created and the Start loop activity create a numeric loop from 1 to the execute command stage output.

sharmabhavesh · Post by **sharmabhavesh** » Fri Oct 02, 2015 9:27 am

Hi Thomas,
The solution you suggested is going to run the job 300 times (once for each file). I want to run the job just once. Like we have file pattern in sequential file which reads multiple files with same pattern, is there a way I can read multiple files(may or may not be in the same directory) with same stage in a single job.

chulett · Post by **chulett** » Fri Oct 02, 2015 1:03 pm

There's no "filelist" concept in DataStage like the indirect reads that Informatica supports, which I think is too damn bad. You may be able to leverage the Filter option in the Sequential stage, you'd just need something at the command line to send the files to stdout with the first record removed from each and the stage would read that input stream rather than the files directly.

Mike · Post by **Mike** » Fri Oct 02, 2015 4:03 pm

The default read method in the sequential file stage is "Specific File(s)". You can specify more than 1 file if you like. You'll have to experiment to see how you need to delimit the file list and whether there is a max size on that property.

Craig never forgets anything, so I had some doubt as to whether I remembered that feature correctly. But I opened up a sequential file stage and there it was.

I think I may have used that capability once a long time ago.

You'll still have to deal with the multiple headers and trailers just like you would if using the file pattern read method.

Mike

chulett · Post by **chulett** » Fri Oct 02, 2015 4:32 pm

You can still use the Filter in this case as long as whatever is supplying the data to stdout - say Perl or awk or sed or your utility of choice - strips the first and/or last record as needed from each file.

It's a bit Old School but it does work. Or can be made to.

ray.wurlod · Post by **ray.wurlod** » Fri Oct 02, 2015 4:50 pm

There is something like a file list in DataStage Sequential File stage; you can specifically name each file (multiple File properties), you can use a wildcard (File Pattern), you can use a Filter command to pre-read the files. The Filter command could even be constructed dynamically in a UNIX script executed from the controlling job sequence.

qt_ky · Post by **qt_ky** » Fri Oct 02, 2015 5:48 pm

According to the docs anyway, the "filelist" concept is there. Check this out. It may not be the best grammar (2nd sentence), so not certain what was intended...

File pattern

Specifies a group of files to import. Specify file containing a list of files or a job parameter representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.

Source: Sequential File stage: Source category

chulett · Post by **chulett** » Fri Oct 02, 2015 6:05 pm

Nice... if it does support a file containing "a list of files" to read then it is exactly like Informatica's "indirect" read option.

Wonder when that was added?

However, does it handle a header record on every file in that case?

qt_ky · Post by **qt_ky** » Fri Oct 02, 2015 6:10 pm

Yes, the description sounds like filelist concept but I think in reality it's misleading documentation. It doesn't actually work that way, as far as I can tell.

I think what they really meant to say was to enter a file pattern that will results in a list of files, because that's how it's implemented. Basically all the ways you guys described above can result in a list of files.

But this indirect read option sounds interesting... Have you all discussed that for server jobs using a folder stage? Is that equivalent to the feature, or not? I haven't used that one.

chulett · Post by **chulett** » Fri Oct 02, 2015 8:54 pm

It is somewhat equivalent. The Folder stage (from what I recall) was originally (only?) meant to be used with the XML stages. It only supports 2 input ports, one contains the filename being processed and the other supplies the entire contents of the file all in one burst. Hence the fit for XML processing. Flat file... not so much. Unless they've enhanced it in the last few years.

Now, IIRC I have used it many moons ago without the second port to just return a list of filenames as a source into a job but that was for a project where I literally just the names of the files being processed.

For Informatica, we have processes that from a base location gather up relative filenames to process, sometimes from a single directory and sometimes from several. They are written to a ".IN" file that becomes the "source file" to be read when set to indirect read mode. It processes them almost as if you'd built a looping job but without the loop - all files are read in order, the current filename is passed into the mapping with each record and the header settings are applied to each file individually as it is encountered. Rather... helpful.

Oh, and when I need a list of the filenames, I just read the indirect file in direct mode.

qt_ky · Post by **qt_ky** » Fri Oct 02, 2015 10:01 pm

OK, that's cool. I've seen you mention this over and over, wishing DataStage had this feature...

As far as I can tell from the Support Portal, the feature has been there since at least version 8.0.1, if not earlier. It also appears from the Knowledge Center by searching on 'file pattern' that it wasn't documented until version 8.5.

In a parallel job, using a file pattern with a command (i.e. "any valid shell expression, in Bourne shell syntax, that generates a list of file names"), you can achieve the filelist concept, or indirect read.

Your command could go and dynamically build a list of files, or it can read a file that contains a list of file names. I just tested it out to be certain and it worked fine, so long as I fully qualify the paths on my files.

Of course, I had to try a number of variations on the syntax before getting it just right. The docs are a bit lacking here for examples. Basically, you surround the command with back ticks. Example:

If file_list1.txt contains a list of file names (with paths) like:
/path/a.txt
/path/m123.txt
/path/XYZ.txt

Then you can use a simple file pattern to read all the files in the list from one Sequential File stage:

Code: Select all

`cat /path/file_list1.txt`

Works like a charm. I don't know about the header and trailer processing; haven't thought about it. Just tested the file pattern, and I think its equivalent to what you described in "the other tool" even though it's not advertised as such. This feature is very flexible! Maybe it's more powerful than IBM realized.

chulett · Post by **chulett** » Sat Oct 03, 2015 8:15 am

Wait... over and over? (sniff) Note to self...

And yet another edit - now that you mention it, I do believe we had this same conversation last time this came up, at least up to the "hey, it actually does exist!" part. Thanks for taking the time to actually go through the workings of it.

qt_ky · Post by **qt_ky** » Sun Oct 04, 2015 6:58 pm

OK, not over and over.

I recall this question on just a handful of topics. You're welcome. Glad it actually works!

sharmabhavesh · Post by **sharmabhavesh** » Sun Oct 04, 2015 7:45 pm

Hi,
Thanks all, the UNIX command really worked for opening all the files through list/indirect file. There's one problem which I am still facing. I am trying to remove the header by using the command `cat listfile.txt|grep -v 'HEADER'`. This is giving me a warning :
Sequential_File_6,0: Field "fact_id" has import error and no default value; data: {F A C T _ I D}, at offset: 0
Sequential_File_6,0: Import warning at record 0.

With this warning the header record is dropped by DS (since the metadata is different.
what could the reason. Is it because I am using a windows file and grep is not able to identify the line terminator or something like that.