How to generate file in UTF-8 format

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jpraveen
Participant
Posts: 71
Joined: Sat Jun 06, 2009 7:10 am
Location: HYD

How to generate file in UTF-8 format

Post by jpraveen »

Hi

I am generating a flat file in fixed-width and i need this file in UTF-8 format.

I changes NLS map in sequential stage and also in job properties to UTF-8.

but when i check the file in unix box, it was showing as us-ascii .

i used below command for file format check in unix

File -bi <FF1>

output:-
text/plain; charset=us-ascii


can you let me know how to generate a file in UTF-8 format ?
Jaypee
vinothkumar
Participant
Posts: 342
Joined: Tue Nov 04, 2008 10:38 am
Location: Chennai, India

Post by vinothkumar »

You can generate the file in ASCII and convert it to UTF-8 using iconv command in unix.

iconv -f ascii -t utf-8 f1.txt -o f1.utf8.txt
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Does the file you are checking actually contain any characters that don't map to the single-byte character set? Otherwise you will always get this value from the "file" command.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Correct me if I am wrong but I thought UTF-8 is "one of several" extended ascii sets, that is bytes 0-127 are "ascii" and 128-255 are mapped for "non english" characters.

If you don't use any chars over 127, I am not sure that any tool can tell the difference (??) between them, assuming we are talking a pure text file without markup or extensions or some other way to differentiate?

Again, I could be wrong, so I am half asking here...
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

UTF-8 is a Unicode character set where characters are encoded from 1 to 4 bytes. ASCII characters are encoded in UTF-8 the same as they are in ASCII.

So us-ascii is essentially a subset of UTF-8.

Your us-ascii file is also a UTF-8 file.

Mike
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yes but isn't there some sort of a magic (maybe 4 byte) header on UTF-8 files? I recently had an issue where a particular set of files would come in either format and my tool when set to UTF-8 could read either without issue but when set to US-ASCII would barf on a UTF-8 file, adding some "garbage" characters to the first field.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Seems like a UTF-8 file could come with an optional 3-byte BOM.

But that is no guarantee that it is a UTF-8 file.

I think if you're expecting a UTF-8 file, getting a us-ascii file should be no problem.

If you're expecting a us-ascii file, getting a UTF-8 file with a BOM is going to be a problem even if everything after the BOM is ASCII.

Mike
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That mirrors my experience.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply