Handling Multi-lingual data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
holygrail
Participant
Posts: 9
Joined: Wed Nov 10, 2004 4:14 am

Handling Multi-lingual data

Post by holygrail »

I do ETL type transformations on data from legacy systems of different countries and feed it to a central database. Now with Western European and Asian countries data, I might need to enable multi-lingual handling capabilities. How can this be done?

What do I have to do at the level of reading multi-lingual data in DSPX jobs, processing them and also at the final DB level?

TIA
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi and Welcome aboard :),
Do you have different NLS in the same column in different rows?
In this case you need NLS that supports all possible languages.

I think you need the DB to support all NLS you need anyway.

If you ment in different columns on the same row, DS has per column NLS but what about your DB?

by the way NLS degrades performance and per column NLS even more so.

I guess your main concern is the DB since you can build custom NLS for DS yourself.

Ray posted a white paper or article regarding NLS (it is in the DataStageXchange get there fgrom the link in the home page of this site)

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello Grail,

you do not necessarily need to activate NLS in DataStage in order to process multiple characters including single & multi-byte data. If, on the other hand, you need to manipulate columns containing multibyte information and/or you need to collate or change files or columns that mix codesets then you will best be served using NLS.

Note that going to NLS is quite easy, but it is a lot of work to go back to a non-NLS impementation so it is something that you really need to think about.
holygrail
Participant
Posts: 9
Joined: Wed Nov 10, 2004 4:14 am

Post by holygrail »

Thanks again, Roy and Arnd.

Arnd,
I do not think it will be necessary to go back to non-NLS from an NLS setup. However I would like some details regarding what are the reasons due to which going back can be tough.

Roy,
I have already researched and worked using NLS settings per column as well as otherwise. Can you provide me some details on how much the performance might be affected?

All,
If my DB is NLS enabled and I use project level NLS setting:

1. Will I have to increase my field sizes where non-ASCII characters are expected? e.g. if a field is varchar(50) and I expect Chinese characters, do I have to allot atleast 150 bytes for it?

2. Also if I cannot pre-empt which fields might require the above change, then I might have to do this for all fields. How much will this affect performance of jobs and the server load?

3. At a DB level how does one enable NLS settings? How does it affect the current data in the DB?

4. Will any other changes be ivloved?

TIA
Grail
goma
Participant
Posts: 10
Joined: Thu Mar 31, 2005 12:03 am

Post by goma »

1. Will I have to increase my field sizes where non-ASCII characters are expected? e.g. if a field is varchar(50) and I expect Chinese characters, do I have to allot atleast 150 bytes for it?

I think it depends on DB implementation, some DB specifys length in byte, the other is charactor length. You need to consult to DB manuall which you want to use.

2. Also if I cannot pre-empt which fields might require the above change, then I might have to do this for all fields. How much will this affect performance of jobs and the server load?
Depends on the volume of data and necessary of NLS conversion, but in the past I observed two or three times peformance in a particular job.

3. At a DB level how does one enable NLS settings? How does it affect the current data in the DB?
I think once you specify DB nls, it is not possible to change NLS settiong, generally speaking you need to set NLS=UTF8 or code page for unicode setting. But once again it depends on DB you want to use.

I am just wondering what is code page(CP) for source data? Do they have particular CP or utf8? In former case, how do you let DataStage know which CP?
holygrail
Participant
Posts: 9
Joined: Wed Nov 10, 2004 4:14 am

Post by holygrail »

Goma,

The source data will be UTF-8. So I think that is cleared up.
goma
Participant
Posts: 10
Joined: Thu Mar 31, 2005 12:03 am

Post by goma »

What exactly do you want to do against data? Do you want to change code page? Do you want to substring/concatenate data in some fields?

What you want to do is just transform, say Field A in source to Field B in target, and suppose your target database is utf-8, non-nls setting seems like work since non-nls setting does not convert any data.
holygrail
Participant
Posts: 9
Joined: Wed Nov 10, 2004 4:14 am

Post by holygrail »

No Goma, I have to do more than just direct mappings. the Transformations might cover all sorts - substring, concatenate et al.

If the input data is UTF-8 encoded, the final DB also UTF-8 enabled and all necessary columns (if not all) have UTF-8 as their extended char set, then how will it matter what sort of transformations one meddles in :?:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Holygrail,

it sounds like you have already gone with NLS, so going back is a moot point. The primary reason to revert from NLS back to non-NLS implementations is performance. Each and every string operation becomes much more cpu-tick consuming with a non-single-byte character set. For instance, the operation MyString[500,1] at a machine level will do a 500-byte offset on the string beginning and retrieve 1 byte. If you have NLS enabled then there is a good chance that you have single, double and up to 4 bytes per character - thus you cannot do a simple offset operation, you have to traverse the string from the beginning, counting initial character bytes until you have reached the 500 count and then taking the next character (which may again be 1 through 4 bytes). Unlike non-NLS string systems these types of operations are not stateless.

Anyway, when it comes to the Oracle DB and NLS implementation, there are a number of quirks that need to be accounted for, i.e. defining a CHAR(5) field and having it's metadata imported as CHAR(20) [this is an effect of the session NLS settings and whether BYTE or CHAR is used as the default].

A good starting point for the DataBase NLS settings is Character Set Encoding
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Another excellent reference is the DataStage NLS manual. However, this does not fully explain in detail the idiosyncratic encoding of Unicode used within DataStage with NLS enabled. Then again, I guess "they" reckon you don't really need to know.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
holygrail
Participant
Posts: 9
Joined: Wed Nov 10, 2004 4:14 am

Post by holygrail »

I used a UTF-8 encoded file (with Jap and Eng chars) for processing. Direct Mapping using a transformer works fine. But in sort, join or other collating activities the encoding gets lost and the output shows junk chars.
I used a JP NLS locale but the director showed the following warning:

main_program: No collation sequence found for ja_JP, falling back on locale ja

The result has the encoding lost. How to handle this? Should I modify the dsenv file?
Post Reply