Ill take a stab at it.
Ascii, lets start there. Ascii was originally 7 bits: an integer from the value 0 to 127. It was later extended to use a fully byte, 8 bits, 0-255.
This works great so long as everyone gets onboard and learns english. Unfortunately, that is not the case in the real world ... there are many languages that have more than 255 possible characters and so on.
This is where someone dropped the ball. There was a brief time when someone with some sense could have set up a 4 byte character and been solid DONE with the problem, using a fixed length field. Instead, they made the new character type variable length! Some of the characters are 1 byte, some 2 bytes, and some take 3 bytes. This is why unicode is a royal pain to work with.
If you iterate over a unicode string as bytes, the ascii text comes through fine, and other language characters become muddled. Because the width varies, you can't treat any multiple of bytes as characters... you can't say that 30 bytes is 10 unicode characters, as it could be anywhere from 10 to 30!
So what is happening if you treat a byte stream as ascii you get one answer and if you treat it as unicode you get a second answer. Most of the characters in your test look like ascii (there are repeated symbols in unicode, though!!) and so those are the same no matter how you look at them. The other values are converted from what they should be into gibberish.
This is WAY overly simplistic, you can wikipedia on unicode for the excruciating details and learn the "reasons" for varying width characters and the historical issues. Its a mess, no matter how you spin it.
Unicode - þ - Extended property
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 729
- Joined: Tue Apr 28, 2009 10:49 pm
-
- Premium Member
- Posts: 729
- Joined: Tue Apr 28, 2009 10:49 pm
Yeah, I think that last bit are just different names out there in the wild for very similar things - code point, code page, characterset - see Character Encoding as one Google'd up example.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: