Recognization on junk

skp · Post by **skp** » Thu Jan 03, 2013 3:02 am

Issue :: I have a column As CustomerName varchar(50) where the Data is represented as below

MHI三原　ホストダウンサイジング　棚卸フェーズ

I want to know whether these data is a Japanece/chianece charactors or Junck Charactors.

Can any one guide me to find the correct stage/rules to define in quality stage which will divide the data according to the chainese/japanese/ Junk.

Please can any one advice on this.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 03, 2013 4:27 am

These are Japanese characters.

There are no rules in QualityStage for dividing characters based on the character sets to which each belongs. That's not really what QualityStage is for, although you might be able to create a heavily customised rule set.

The preferred tool would be DataStage, and you'd still need some custom code to identify whether a particular character belongs to a particular character set (aka code page). But why?

Know also that Chinese, Japanese and Korean share a few hundred characters (known as the CJK characters under the Unicode standards).

ArndW · Post by **ArndW** » Thu Jan 03, 2013 5:05 am

That is standard japanese text, discussing host downsizing.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 03, 2013 1:53 pm

In my experience it is totally undesirable to dismiss any character as "junk" without consultation with the owners of the data.

skp · Post by **skp** » Fri Jan 04, 2013 12:06 am

Actually i am desired to change the given character to English.

As this was arriving on a daily basis ,I want to work out a process in datastage which will convert Japanese Characters to English Characters.

Can this process be implemented in Datastage itself?

Thanks

ray.wurlod · Post by **ray.wurlod** » Fri Jan 04, 2013 2:36 am

Yes, but not meaningfully. For example you can transliterate (specify the sound of a Japanese character using English characters, such as "東" and "京" becoming "Tō" and "kyō") but there is no one-to-one correspondence between CJK characters and English characters. Anyone who specifies such a requirement is ignorant of the differences. Resist stupid requirements!

rjdickson · Post by **rjdickson** » Sat Jan 05, 2013 5:18 pm

Ray is spot on. QualityStage and DataStage are not translation engines. You may want to look at translations engines like Google and the like.

sendmkpk · Post by **sendmkpk** » Fri May 10, 2013 3:19 am

Hi

I just got an idea here, cant we use the webservice and connect to google to get the translated respone back.

Would it work?

Reg

ray.wurlod · Post by **ray.wurlod** » Fri May 10, 2013 3:25 am

Quite possibly, translation engines are getting better. But it would not be DataStage/QualityStage doing the work, which was the gist of the original post.