Page 1 of 1

Remove multiple Ascii characters from input data

Posted: Wed Sep 06, 2017 12:21 pm
by Vrisha
Hi,

I have a requirement to remove multiple ascii characters from the input data.
Eg- input data has below values

column name - Description
==================
Paratransit¿Web-Based¿Trip¿Boo
TRE DFW ROW Maintenance Tie Re

Input data length - char (30)
Output data length - Varchar(30)

I am getting an error 'A data conversion error was encountered in bulk load mode for row '


1)Transformation rule - Convert(char(160):char(168):char(194),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION)))- error as above

2) CONVERT(CHAR(194), "", CONVERT(CHAR(191), "", Convert(char(160),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))))-- error as above

3) When I changed the datalength to 40 everything got loaded.But not allowed to do so.

4)I changed the bulk load to write mode as 'Insert' and table action to
'Create' in target oracle stage and getting error of 'Error code: 12899, Error message: ORA-12899: value too large for column "MSPO_TEST"."DESC_PROJECTS2"."DESCRIPTION" (actual: 34, maximum: 30).


How should I remove the multiple ascii characters from the input data? I am able to remove one ascii character char(160) using the Ereplace function (which I posted in the forum).

Please let me know what is the mistake I am doing. Thanks.

Posted: Wed Sep 06, 2017 12:29 pm
by chulett
Perhaps you should look into why they are like that and see what needs to be done to correctly 'translate' them rather than simply removing them... just a thought.

Posted: Wed Sep 06, 2017 12:50 pm
by Vrisha
Hi Craig,

Source application team can't provide quality data. They told to remove the ascii characters using datastage and load into target.

In normal source oracle database, we can't see ascii characters. So using the export wizard, I uploaded into csv file and through eyecheck I found char values of 160,191,194 in the input.

I don't know how to write a logic to remove / translate them apart from what I tried above.

Posted: Wed Sep 06, 2017 1:50 pm
by UCDI
I found this to be extra painful in datastage and just did it in C.

make a lookup table like this

static const unsigned char pt[256] = {
' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','!','"','#','$','%','&','\'','(',
')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<',
'=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_','`','a','b','c','d',
'e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x',
'y','z','{','|','}','~',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '
};

and then just loop over the string
newstring = pt[oldstring];


and then do an internal and external trim on space.
don't forget to fix the end of string : newstring[max] = 0;

Posted: Thu Sep 07, 2017 6:30 am
by Vrisha
Thank you, UCDI.

Your solution is a bit confusing for me. I want to use ereplace or convert to remove multiple ascii characters.

I don't know what mistake I am doing in the syntax of Convert for handling multiple ascii characters

Posted: Thu Sep 07, 2017 9:11 am
by UCDI
ereplace is horrible for this. you have to string like 100 of them together, its slow and the syntax is absolutely insane. It looks like
ereplace(ereplace(ereplace(ereplace .... which loops over the same string over 100 times replacing 1 bad character at a time in each loop.

my solution is to use c or basic or whatever to do it all at once in 1 loop.

If you are 100% sure you only have 3 bad characters and will only ever in the future have those 3 characters, e-replace should work when chained like this.

But you said you only spot checked and your source is unreliable, so I advise getting ALL the nonstandard characters out at once.

My lookup table replace approach kills everything that is not a normal character, not just the 3 you have found so far.

it is simple code. You make a lookup table of the entire ascii table (0-255) and replace the string data value with the lookup table value which is space for all junk characters and no change for the usable values. Then you can clean up extra spaces in datastage or the C whichever you prefer.

I can give you the basic code instead if you prefer, which you can use directly in a basic_transformer instead of a regular transformer. Its a little slower, but talking seconds per millions of records, hardly worth a mention in the grand scheme.

Posted: Thu Sep 07, 2017 10:14 am
by Vrisha
The problem is resolved

I checked Ray's post (Line Breaks in the flat files) and find the clue. It worked.

Steps
1)Created 3 stage variables
svChar160 - Char(160,@TRUE)
svChar191 - Char(191,@TRUE)
svChar194 - Char(194,@TRUE)

2) Transformation rule in transformer
Convert(svChar160:svChar191:svChar194,' ',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))

Source data - Paratransit¿Web-Based¿Trip¿Boo

Target data - ParatransitWeb-BasedTripBoo

Thank you, Ray for your valuable post.

Posted: Thu Sep 07, 2017 1:06 pm
by Vrisha
Thank you UCDI for your valuable suggestion.

Remove multiple Ascii (>256) characters from input data

Posted: Fri May 25, 2018 11:00 am
by BSridhar
Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?

Thanks

Posted: Sat May 26, 2018 4:01 am
by ray.wurlod
UniChar() for Unicode data. ASCII stops at 255.

Posted: Tue May 29, 2018 8:32 am
by BSridhar
Hi Ray,

I couldn't find UniChar in version 11.5. Could you please guide me to find? I am sorry.

Thank You

Posted: Tue May 29, 2018 5:09 pm
by ray.wurlod
That suggests that you don't have NLS enabled, in which case you can't process anything other than ASCII.

Re: Remove multiple Ascii (>256) characters from input da

Posted: Wed May 30, 2018 11:17 am
by UCDI
BSridhar wrote:Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?
not sure about datastage but 'most' software if told a 'text' file is in 1-byte code (ascii or one-off encodings) it will read unicode letters as a group of multiple ascii symbols.

For english and generally western languages, most of the unicode becomes some sort of space / blank symbol and the ascii letter, so hello looks like h e l l o and so on. However any truly non-english or non-ascii codes will make a mess and any code that attempts to clean it up will damage the actual data.

so while it is 'sort of' safe to hack around on 'american english text' that happens to be stored in unicode as if it were ascii, you are playing with fire to do this. If you are using unicode, use unicode for everything. If you are using ascii, you can promote to unicode if needed or stay in ascii. Avoid mixing them as much as possible. And there are other encodings from older systems that you have to use sometimes too, mostly 1 byte but NOT ascii.

Ill do a little of the hacking but not in datastage -- I have some code that can read dataset headers which are apparently in unicode for the internal textual parts... :)