Unicode - Ã¾ - Extended property

UCDI · Post by **UCDI** » Tue Dec 20, 2016 8:23 am

Ill take a stab at it.

Ascii, lets start there. Ascii was originally 7 bits: an integer from the value 0 to 127. It was later extended to use a fully byte, 8 bits, 0-255.

This works great so long as everyone gets onboard and learns english. Unfortunately, that is not the case in the real world ... there are many languages that have more than 255 possible characters and so on.

This is where someone dropped the ball. There was a brief time when someone with some sense could have set up a 4 byte character and been solid DONE with the problem, using a fixed length field. Instead, they made the new character type variable length! Some of the characters are 1 byte, some 2 bytes, and some take 3 bytes. This is why unicode is a royal pain to work with.

If you iterate over a unicode string as bytes, the ascii text comes through fine, and other language characters become muddled. Because the width varies, you can't treat any multiple of bytes as characters... you can't say that 30 bytes is 10 unicode characters, as it could be anywhere from 10 to 30!

So what is happening if you treat a byte stream as ascii you get one answer and if you treat it as unicode you get a second answer. Most of the characters in your test look like ascii (there are repeated symbols in unicode, though!!) and so those are the same no matter how you look at them. The other values are converted from what they should be into gibberish.

This is WAY overly simplistic, you can wikipedia on unicode for the excruciating details and learn the "reasons" for varying width characters and the historical issues. Its a mess, no matter how you spin it.

karthi_gana · Post by **karthi_gana** » Thu Dec 22, 2016 5:32 am

Ray, NLS is installed and enabled in my project..I can see NLS tab in the job properties. But when I see the NLS tab, I can see Project Default (UTF-8). is it something overwritten during installation ?

karthi_gana · Post by **karthi_gana** » Thu Dec 22, 2016 5:35 am

UCDI, Thanks for your time and explanation. I need to read it once again to understand more on this.

chulett · Post by **chulett** » Thu Dec 22, 2016 7:34 am

Look like support questions to me. Or something for Google.

chulett · Post by **chulett** » Thu Dec 22, 2016 8:57 am

Yeah, I think that last bit are just different names out there in the wild for very similar things - code point, code page, characterset - see Character Encoding as one Google'd up example.

ray.wurlod · Post by **ray.wurlod** » Fri Dec 23, 2016 3:50 am

UTF-8 is the default map for reading and writing.

UTF-16 is how DataStage parallel jobs store data internally.