import .csv fails when delimiter or LF found within value

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
emaher
Participant
Posts: 3
Joined: Mon Apr 14, 2014 10:27 am
Location: Seattle

import .csv fails when delimiter or LF found within value

Post by emaher »

I have a .csv file exported from our cisco phone system.
It uses semicolon as a delimiter for the columns and appears to have a semicolon after the last column as well as a LF character.

The file appears to have the following escaped (using a backslash): null/empty columns, semicolons & line-feeds that appear in the data.

My parallel job is structured as:
[sequential file stage] -> [transformer stage] -> [netezza connector stage]

My import process spits out warnings (resulting in skipped rows) whenever it sees either a semicolon or a LF within a string value. Could be that one or the other is failing it, not really sure.

I have attempted various iterations of convert and ereplace within the transformer stage and each time I see the following warnings and the problem records are skipped.

- Delimiter for field "field name" not found: input: {some data from the row}, at offset: x
- Field "field name" has import error and no default value: data: {some data from the row}, at offset x

The rest of the data will load- just not rows that are broken due to embedded delimiter or LF. (all broken rows appear to have both occurrences as represented below)

I have tried using a server job instead of a parallel job along with the "Contains terminators" set to 'Yes' but this fails (no rows get loaded).

It seems as if I should be able to tell DataStage to ignore characters immediately following the backslash, but I cannot figure out where/how to do this.

Code: Select all

 1 78000274787.0;0;1;1;2;2;\ ;1;89334;\ ;3;-1;+12025550101;2019-08-21 19:40:41.541;...
 2 78000274769.0;1;1;1;1;4;er failures\; nested exception is: \$
 3    5589 ^Icom.cisco.wfapi.WFExecutionException: Too many transfer failures;3;-1;2025550101;
 4 78000274788.0;0;1;1;1;1;\ ;3;-1;2025550101;2;-1;29748;2019-08-21 19:40:45.863;
 5 78000274790.0;0;1;1;3;1;\ ;1;87574;\ ;2;-1;29830;2019-08-21 19:40:57.714;2019-08-21 19:46:12.265;
Above is a sample of a few records:
Line 1: Good data, imports fine
Line 2: there is an escaped semicolon (after the word 'failures') and an escaped LF at the end
Line3: Is the continuation of the data that was pushed to the next line due to the LF in the row above.
Line 4: Good data, imports fine
Line 5: Good data, imports fine

This represents what the line-endings look like, in case that's important. Note the '$' which is the LF representation when viewing with Vim.

Code: Select all

1 ... ;\ ;N;f;f;f;f;f;-1;\ ;000000000068790706EE000A00000000;0;-1;$
2 78000274769.0;1;1;1;1;4;er failures\; nested exception is: \$
3 ... ;\ ;N;f;f;f;f;f;-1;0000000000060FBC0214926200000000;\ ;0;-1;$
4 ... ;\ ;N;f;f;f;f;f;-1;0000000000060F960214927200000000;\ ;0;-1;$
5 ... ;\ ;N;f;f;f;f;f;-1;000000000068790D06EE004100000000;000000000068790D06EE003F00000000;0;-1;$

As a last resort I am thinking of running the file through a Sed pre-process, but I keep thinking that DS can handle this simple task.

Thank you in advance.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First off, thanks for mentioning that you tried a Server job. This issue has been talked about for years here and that is always the "go to" solution that will get trotted out. I've taken that route back in the day and generally it works just fine. :?

And technically, it really isn't all that simple. Any process consuming a sequential file on UNIX "knows" that a linefeed is the end of a record, even Vim as you can see. I also don't believe they can escaped but would have to spend some time researching that. If nothing comes of that research (or your post here) perhaps open a case with your official support provider?

Your best defense against something like this - if at all possible - is getting the source system to stop allowing record or field terminators in the data before it is exported to flat file. That is the "simple task" since the problem (and solution) is obvious at that point. :wink: Of course, that may not be any kind of possible - meaning the much more difficult burden of identifying them falls on you. And the transformer is too late to handle anything like this.

I believe that you'll find and perhaps others who have dealt with this will chime in, is to work out your last resort using something like awk or sed or perl or whatever your weapon of choice is there. If you can put together something that will read the file and send to standard output clean records, you can use that in the Filter option of the Sequential File stage so you don't need any kind of separate pre-processing step.

Good luck!
-craig

"You can never have too many knives" -- Logan Nine Fingers
emaher
Participant
Posts: 3
Joined: Mon Apr 14, 2014 10:27 am
Location: Seattle

Post by emaher »

Thank you for responding.
Unfortunately, and like many seem to have experienced, I am not able to affect change in terms of how the file is output and simply have to work with what I get.
I ended up going down the path of using a sed command and was able to add it to the Filter property of the Sequential File Stage. This was pretty handy as I was not looking forward to adding the extra overhead of creating a custom script and dealing with extra file operations.
The escape character within the data made this a fairly straight forward command.
Again, thank you for your input.
Eric
Post Reply