Chinese Char /UTF-8

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Chinese Char /UTF-8

Post by dj »

Hi,

I'm trying to read Chinese file and it fails when CHAR and for VARCHAR it works.

The existing other regions have CHAR and we are trying to minimize the changes in the layout.

The layout is Char Fixed width file Unicode.

FIRSTNAME:Char(30)-Unicode
MiddleName:Char(1)-Unicode
LastName:Char(30)-Unicode

Existing data:
COLNIE MPROLL
chinese data:
李娜 MPROLL

when viewed in Hex editor -chinese char took around 3bytes.
Hence i tried firstname:6bytes(data)+24padchars but no luck.

External ustring too short. Imported only 0 external characters into a ustring of fixed length 1.
##W IIS-DSEE-TFIG-00201 09:53:53(001) <SQ,0> Field "MiddleName" has import error and no default value; data: <empty>, at offset: 787


Is it only varchar is supported for multi-byte?

Thanks in advance!
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Any reason you are not using UTF16?

Chinese is a double byte characterset.
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

I'm looking at the various options of handling the chinese/Thai data.

To get started,selected UTF-8 as it handles multi-byte data.

1)Is it still UTF-8,unicode will not be handle double byte data?
2) And it has to be always variable length?

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

UTF-8 is an encoding of the Unicode code points, and does handle multi-byte data (though using up to four bytes per character).

VarChar will give you fewer problems than Char, because the latter requires fields to be padded to length.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abyss
Premium Member
Premium Member
Posts: 172
Joined: Thu May 22, 2014 12:43 am

Post by abyss »

not sure about thai but always use UTF-16 for chinese, japanese and koran characters
pjedson
Participant
Posts: 2
Joined: Wed Sep 28, 2016 3:26 pm
Location: GSO,NC

Post by pjedson »

Handling NLS data is bit tricky.
Troubleshooting depends on the database involved and OS.

If ODBC stages are used, check following.
Check IANAAppCodePage value in odbc.ini
Use wide character types wherever possible

Hope this helps.
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

Thanks for your replies.

Are there any other issues apart from bytes space b/w UTF-8 /UTF-16?

1) We were able to use UTF-8 for thai and china - both seq file as i/p and o/p.

2) Is there a way to check in ds as right now i dont have temp db to check for bytes space usage b/w utf-8/utf-16.

3) Is it possible to read mainframe i/p UTF-16 ,process it in datastage and load into MDM tables(utf-8)?

Thanks.
Post Reply