Problem in converting UTF8 Character set to ASCII

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dhletl
Participant
Posts: 22
Joined: Mon Aug 23, 2004 1:13 am

Problem in converting UTF8 Character set to ASCII

Post by dhletl »

Hi,

I am facing problem with UTF8 character set.

I have an input file in UTF8 character set - on reading that file using sequential stage I define the NLS as UTF-8.
There is a join and a transformer stage in my designed job.
Finally am taking a sequential file as output in ASCII mode.
The file obtained does seem to be in AScii mode (checked from Unix) - however it stilll contains few junk characters.

Can you help me resolve this problem.

Thanks
Nitin
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

NLS is not really intended for translation of character sets, except from the various external coding schemes (such as UTF-8, GB2312, BIG5, SHIFT-JIS and so on) to and from DataStage's internal character set, which is an idiosyncratic encoding (called UV-UTF8) of Unicode code points; UV-UTF8 preserves dynamic array delimiter characters 0xF8 through 0xFF as single-byte representations.

That ASCII (or ISO8859, which is a superset of ASCII) are close means that most of the characters work with what you are doing. Can you identify which characters are not being properly mapped, and what the actual "junk characters" are? Knowing this may help in diagnosing what's happening.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dhletl
Participant
Posts: 22
Joined: Mon Aug 23, 2004 1:13 am

Post by dhletl »

Ray,

The junk characters coming out (in ascii file) are something like "^Z".

Essentially, I require to read a UTF8 file as source file in one of my job.
Subsequently in the process, all intermittent / temporary staging I want to stick to ascii character set. And I need to generate a final file (after all processing) in UTF8 character set.
Any pointers on this?

Thanks and Regards,
Nitin
Eric
Participant
Posts: 254
Joined: Mon Sep 29, 2003 4:35 am

Post by Eric »

You need to find the Hex or Oct code in the UTF8 file for the junk character. You can then prove if it is an ASCII character or not.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I think your "^Z" is probably a carriage return (CR). Usually when I see this on a UNIX box, it is because someone did a binary mode transfer of an ASCII file from Windows to UNIX. Line terminators on Windows are CRLF. On UNIX the line terminator is just a LF. Fix it by transferring the file in ASCII mode. In a server job, you could alternatively change properties to "DOS style" line terminators (don't know if this is an option for Parallel jobs though).

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Specifically, Ctrl-Z is the end-of-file marker in DOS. So this is a likely candidate if the data originally came from a Windows system.

It's also a possibility that "UTF-8" on Windows and "UTF-8" on your UNIX aren't exactly the same; there are quite a few UTF-8 encodings out there. You can learn about them from the Unicode Consortium web site, search for "UTF-8".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Thanks for the clarification Ray. I just realized that I confused the "^Z" with "^M" (which would appear at the end of every line if it was a line termination problem).

Mike
Post Reply