Handling Multi-lingual data

holygrail · Post by **holygrail** » Fri Apr 01, 2005 12:55 am

I do ETL type transformations on data from legacy systems of different countries and feed it to a central database. Now with Western European and Asian countries data, I might need to enable multi-lingual handling capabilities. How can this be done?

What do I have to do at the level of reading multi-lingual data in DSPX jobs, processing them and also at the final DB level?

TIA

roy · Post by **roy** » Fri Apr 01, 2005 1:13 am

Hi and Welcome aboard

,
Do you have different NLS in the same column in different rows?
In this case you need NLS that supports all possible languages.

I think you need the DB to support all NLS you need anyway.

If you ment in different columns on the same row, DS has per column NLS but what about your DB?

by the way NLS degrades performance and per column NLS even more so.

I guess your main concern is the DB since you can build custom NLS for DS yourself.

Ray posted a white paper or article regarding NLS (it is in the DataStageXchange get there fgrom the link in the home page of this site)

IHTH,

ArndW · Post by **ArndW** » Fri Apr 01, 2005 1:41 am

Hello Grail,

you do not necessarily need to activate NLS in DataStage in order to process multiple characters including single & multi-byte data. If, on the other hand, you need to manipulate columns containing multibyte information and/or you need to collate or change files or columns that mix codesets then you will best be served using NLS.

Note that going to NLS is quite easy, but it is a lot of work to go back to a non-NLS impementation so it is something that you really need to think about.

holygrail · Post by **holygrail** » Fri Apr 01, 2005 2:01 am

Thanks again, Roy and Arnd.

Arnd,
I do not think it will be necessary to go back to non-NLS from an NLS setup. However I would like some details regarding what are the reasons due to which going back can be tough.

Roy,
I have already researched and worked using NLS settings per column as well as otherwise. Can you provide me some details on how much the performance might be affected?

All,
If my DB is NLS enabled and I use project level NLS setting:

1. Will I have to increase my field sizes where non-ASCII characters are expected? e.g. if a field is varchar(50) and I expect Chinese characters, do I have to allot atleast 150 bytes for it?

2. Also if I cannot pre-empt which fields might require the above change, then I might have to do this for all fields. How much will this affect performance of jobs and the server load?

3. At a DB level how does one enable NLS settings? How does it affect the current data in the DB?

4. Will any other changes be ivloved?

TIA
Grail

goma · Post by **goma** » Fri Apr 01, 2005 2:18 am

1. Will I have to increase my field sizes where non-ASCII characters are expected? e.g. if a field is varchar(50) and I expect Chinese characters, do I have to allot atleast 150 bytes for it?

I think it depends on DB implementation, some DB specifys length in byte, the other is charactor length. You need to consult to DB manuall which you want to use.

2. Also if I cannot pre-empt which fields might require the above change, then I might have to do this for all fields. How much will this affect performance of jobs and the server load?
Depends on the volume of data and necessary of NLS conversion, but in the past I observed two or three times peformance in a particular job.

3. At a DB level how does one enable NLS settings? How does it affect the current data in the DB?
I think once you specify DB nls, it is not possible to change NLS settiong, generally speaking you need to set NLS=UTF8 or code page for unicode setting. But once again it depends on DB you want to use.

I am just wondering what is code page(CP) for source data? Do they have particular CP or utf8? In former case, how do you let DataStage know which CP?

holygrail · Post by **holygrail** » Fri Apr 01, 2005 2:33 am

Goma,

The source data will be UTF-8. So I think that is cleared up.

goma · Post by **goma** » Fri Apr 01, 2005 3:01 am

What exactly do you want to do against data? Do you want to change code page? Do you want to substring/concatenate data in some fields?

What you want to do is just transform, say Field A in source to Field B in target, and suppose your target database is utf-8, non-nls setting seems like work since non-nls setting does not convert any data.

holygrail · Post by **holygrail** » Fri Apr 01, 2005 3:22 am

No Goma, I have to do more than just direct mappings. the Transformations might cover all sorts - substring, concatenate et al.

If the input data is UTF-8 encoded, the final DB also UTF-8 enabled and all necessary columns (if not all) have UTF-8 as their extended char set, then how will it matter what sort of transformations one meddles in

ArndW · Post by **ArndW** » Fri Apr 01, 2005 3:23 am

Holygrail,

it sounds like you have already gone with NLS, so going back is a moot point. The primary reason to revert from NLS back to non-NLS implementations is performance. Each and every string operation becomes much more cpu-tick consuming with a non-single-byte character set. For instance, the operation MyString[500,1] at a machine level will do a 500-byte offset on the string beginning and retrieve 1 byte. If you have NLS enabled then there is a good chance that you have single, double and up to 4 bytes per character - thus you cannot do a simple offset operation, you have to traverse the string from the beginning, counting initial character bytes until you have reached the 500 count and then taking the next character (which may again be 1 through 4 bytes). Unlike non-NLS string systems these types of operations are not stateless.

Anyway, when it comes to the Oracle DB and NLS implementation, there are a number of quirks that need to be accounted for, i.e. defining a CHAR(5) field and having it's metadata imported as CHAR(20) [this is an effect of the session NLS settings and whether BYTE or CHAR is used as the default].

A good starting point for the DataBase NLS settings is Character Set Encoding

ray.wurlod · Post by **ray.wurlod** » Fri Apr 01, 2005 6:49 am

Another excellent reference is the DataStage NLS manual. However, this does not fully explain in detail the idiosyncratic encoding of Unicode used within DataStage with NLS enabled. Then again, I guess "they" reckon you don't really need to know.

holygrail · Post by **holygrail** » Tue Apr 05, 2005 4:45 am

I used a UTF-8 encoded file (with Jap and Eng chars) for processing. Direct Mapping using a transformer works fine. But in sort, join or other collating activities the encoding gets lost and the output shows junk chars.
I used a JP NLS locale but the director showed the following warning:

main_program: No collation sequence found for ja_JP, falling back on locale ja

The result has the encoding lost. How to handle this? Should I modify the dsenv file?