You say EBCDIC, I say ASCII, let's call the whole thing off

FranklinE · Post by **FranklinE** » Fri Dec 19, 2014 7:55 am

Recent inquiries about how DataStage handles characters, their hexidecimal code values and maps between character sets prompts me to ask a possibly dumb question... but as I always tell my children, there's no such thing as a dumb question, but there are plenty of dumb answers out there.

It goes like this in my basic job design.

First Stage (usually FTP): reads data from z/OS host; Format tab uses default COBOL attributes including EBCDIC.

Last Stage (usually Sequential File): writes data to local DS server running Unix (previously Solaris, now RHE Linux); Format tab includes ASCII.

Question: on the intervening links, on which character set is DataStage performing the coded instructions? If it changes from EBCDIC to ASCII before the Last Stage, where does that happen?

Seriously, my ignorance of these details makes me feel inadequate when trying to answer questions about EBCDIC. Maybe it's just my natural paranoia...

EDIT: In case it's needed for thoughtful responses, the rest of the basic design always has a Transformer to map the source data to the layout required by the destination application. Sometimes other things like filters and joins are used, but not always. There's the occasional lookup for some jobs, but they are in the minority.

qt_ky · Post by **qt_ky** » Fri Dec 19, 2014 10:47 am

The parallel framework processes only data sets, no matter what the external format is.

The links between stages are virtual data sets.

The internal data format is UTF-16.

Import and export operators are used to perform the conversions.

If subsequent parallel jobs need to process the same data as the first parallel job in your example, which produces a sequential file, it is more efficient to use the Data Set stage between the multiple parallel jobs, because you would avoid the export and import overhead.

FranklinE · Post by **FranklinE** » Fri Dec 19, 2014 10:51 am

Thanks, Eric. Just to be sure I understand, please confirm or correct:

In my design/example, the output links from the FTP stage, and every link between there and the input links to the Sequential File stage, process the data using UTF-16.

qt_ky · Post by **qt_ky** » Fri Dec 19, 2014 10:54 am

Yes, that is correct. I learned about it in an IBM training class one time. I edited my post above to add that last paragraph too.

FranklinE · Post by **FranklinE** » Fri Dec 19, 2014 10:58 am

Again, thank you.

Our files are destined for use by another application. I'd very much like to use datasets, but they are not an option.

qt_ky · Post by **qt_ky** » Fri Dec 19, 2014 11:14 am

You're welcome. It is documented here:

http://www-01.ibm.com/support/knowledge ... _Sets.html

Also, now that I read a bit more, I am wondering myself if the UTF-16 statement may only apply to ustring data types (unicode extended property), whereas string data types are 8-bit ASCII.

Perhaps UTF-16 encompasses and accommodates 8-bit ASCII. Need someone more expert to clarify...

Anyhow, what I relayed above is what I was taught in training.

ray.wurlod · Post by **ray.wurlod** » Fri Dec 19, 2014 12:54 pm

If NLS is enabled, parallel jobs use UTF-16 internally.

UTF-16 shares code points 0 through 127 with ASCII. Most implementations of UTF-16 share code points 0 through 255 with "extended", or 8-bit, ASCII.

The story is different for server jobs.