Unicode - þ - Extended property

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Ill take a stab at it.

Ascii, lets start there. Ascii was originally 7 bits: an integer from the value 0 to 127. It was later extended to use a fully byte, 8 bits, 0-255.

This works great so long as everyone gets onboard and learns english. Unfortunately, that is not the case in the real world ... there are many languages that have more than 255 possible characters and so on.

This is where someone dropped the ball. There was a brief time when someone with some sense could have set up a 4 byte character and been solid DONE with the problem, using a fixed length field. Instead, they made the new character type variable length! Some of the characters are 1 byte, some 2 bytes, and some take 3 bytes. This is why unicode is a royal pain to work with.

If you iterate over a unicode string as bytes, the ascii text comes through fine, and other language characters become muddled. Because the width varies, you can't treat any multiple of bytes as characters... you can't say that 30 bytes is 10 unicode characters, as it could be anywhere from 10 to 30!

So what is happening if you treat a byte stream as ascii you get one answer and if you treat it as unicode you get a second answer. Most of the characters in your test look like ascii (there are repeated symbols in unicode, though!!) and so those are the same no matter how you look at them. The other values are converted from what they should be into gibberish.

This is WAY overly simplistic, you can wikipedia on unicode for the excruciating details and learn the "reasons" for varying width characters and the historical issues. Its a mess, no matter how you spin it.
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

Ray, NLS is installed and enabled in my project..I can see NLS tab in the job properties. But when I see the NLS tab, I can see Project Default (UTF-8). is it something overwritten during installation ?
Karthik
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

UCDI, Thanks for your time and explanation. I need to read it once again to understand more on this.
Karthik
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Look like support questions to me. Or something for Google.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yeah, I think that last bit are just different names out there in the wild for very similar things - code point, code page, characterset - see Character Encoding as one Google'd up example. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

UTF-8 is the default map for reading and writing.

UTF-16 is how DataStage parallel jobs store data internally.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply