Problem in converting UTF8 Character set to ASCII

dhletl · Post by **dhletl** » Thu Nov 18, 2004 10:40 pm

Hi,

I am facing problem with UTF8 character set.

I have an input file in UTF8 character set - on reading that file using sequential stage I define the NLS as UTF-8.
There is a join and a transformer stage in my designed job.
Finally am taking a sequential file as output in ASCII mode.
The file obtained does seem to be in AScii mode (checked from Unix) - however it stilll contains few junk characters.

Can you help me resolve this problem.

Thanks
Nitin

ray.wurlod · Post by **ray.wurlod** » Fri Nov 19, 2004 1:12 am

NLS is not really intended for translation of character sets, except from the various external coding schemes (such as UTF-8, GB2312, BIG5, SHIFT-JIS and so on) to and from DataStage's internal character set, which is an idiosyncratic encoding (called UV-UTF8) of Unicode code points; UV-UTF8 preserves dynamic array delimiter characters 0xF8 through 0xFF as single-byte representations.

That ASCII (or ISO8859, which is a superset of ASCII) are close means that most of the characters work with what you are doing. Can you identify which characters are not being properly mapped, and what the actual "junk characters" are? Knowing this may help in diagnosing what's happening.

dhletl · Post by **dhletl** » Fri Nov 19, 2004 1:52 am

Ray,

The junk characters coming out (in ascii file) are something like "^Z".

Essentially, I require to read a UTF8 file as source file in one of my job.
Subsequently in the process, all intermittent / temporary staging I want to stick to ascii character set. And I need to generate a final file (after all processing) in UTF8 character set.
Any pointers on this?

Thanks and Regards,
Nitin

Eric · Post by **Eric** » Fri Nov 19, 2004 6:36 am

You need to find the Hex or Oct code in the UTF8 file for the junk character. You can then prove if it is an ASCII character or not.

Mike · Post by **Mike** » Fri Nov 19, 2004 7:21 am

I think your "^Z" is probably a carriage return (CR). Usually when I see this on a UNIX box, it is because someone did a binary mode transfer of an ASCII file from Windows to UNIX. Line terminators on Windows are CRLF. On UNIX the line terminator is just a LF. Fix it by transferring the file in ASCII mode. In a server job, you could alternatively change properties to "DOS style" line terminators (don't know if this is an option for Parallel jobs though).

Mike

ray.wurlod · Post by **ray.wurlod** » Fri Nov 19, 2004 3:42 pm

Specifically, Ctrl-Z is the end-of-file marker in DOS. So this is a likely candidate if the data originally came from a Windows system.

It's also a possibility that "UTF-8" on Windows and "UTF-8" on your UNIX aren't exactly the same; there are quite a few UTF-8 encodings out there. You can learn about them from the Unicode Consortium web site, search for "UTF-8".

Mike · Post by **Mike** » Fri Nov 19, 2004 5:37 pm

Thanks for the clarification Ray. I just realized that I confused the "^Z" with "^M" (which would appear at the end of every line if it was a line termination problem).

Mike