DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Thu Oct 26, 2017 11:03 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Windows
Hi everyone, new here. I am attempting to bring in dates to our datawarehouse, the issue that I have is that these dates are free form decimal of varying formats. Some may be %yyyy%ddd, others %mm%dd%yyyy, and still others are %mm%dd%yy. The biggest issue is that I can have multiple "formats" in the same column, (don't get me started on why it is that way, as of now I cannot fix that sadly). Through my iterations of working on this I thought I had found a solution as follows:
Step 1
Using a transform stage, convert all the date decimals to strings.

Step2
Using another transform stage, Set the following variables in the stage:
YearsInPast100 == DateOffsetByComponents(CurrentDate(), -100, 0, 0)
YearsInFuture100 == DateOffsetByComponents(CurrentDate(), 100, 0, 0)
Utilize IF THEN ELSE to test the conversions and apply the appropriate conversion:
Code:
IF $1 <> 0
THEN (
   IF (IsValidDate(StringToDate(Right("0":$1,8),"%mm%dd%yyyy")) = 1 AND (StringToDate(Right("0":$1,8),"%mm%dd%yyyy") > YearsInPast100 AND StringToDate(Right("0":$1,8),"%mm%dd%yyyy") < YearsInFuture100))
   THEN StringToDate(Right("0":$1,8),"%mm%dd%yyyy")
   ELSE (
      IF (IsValidDate(StringToDate(Right($1,7),"%yyyy%ddd")) = 1 AND (StringToDate(Right($1,7),"%yyyy%ddd") > YearsInPast100 AND StringToDate(Right($1,7),"%yyyy%ddd") < YearsInFuture100))
      THEN StringToDate(Right($1,7),"%yyyy%ddd")
      ELSE (
         IF (IsValidDate(StringToDate(Right("0":$1,6),"%mm%dd%1950yy")) = 1 AND (StringToDate(Right("0":$1,6),"%mm%dd%1950yy") > YearsInPast100 AND StringToDate(Right("0":$1,6),"%mm%dd%1950yy") < YearsInFuture100))
         THEN StringToDate(Right("0":$1,6),"%mm%dd%1950yy")
         ELSE SetNull()
         )
      )
   )
ELSE SetNull()


Any help on this would be greatly appreciated, I've been banging my head on the wall on this for a while now. The issue with the above code is that the IF condition never appears to be met.

'82510' gets transformed to NULL for example.

_________________
-Me
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42285
Location: Denver, CO
Points: 217115

Post Posted: Thu Oct 26, 2017 11:31 am Reply with quote    Back to top    

Okay, I haven't spent any time really evaluating what you are doing but the thing that jumps out at me is - don't use IsValidDate as it doesn't do what you think it is doing. Rather switch them to IsValid() with "date" as the type, you should find it works way more better.

You also technically don't need the "= 1" in the test as it will automatically resolve itself to true or false.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2668
Location: USA
Points: 19836

Post Posted: Thu Oct 26, 2017 11:47 am Reply with quote    Back to top    

I think the derivation logic looks OK.

I have not seen the $1 syntax in a derivation though. Is that actually what you have in there? For each instance of $1, I would expect instead to see an input column name formatted like link_name.column_name or a stage variable name.

The other question is your input column actually a decimal data type or is it a string as mentioned in the example? I ran a quick test using your derivation and gave it a Varchar string as input and it resolved to an actual date.

Perhaps the problem is upstream in where you're converting decimals to strings. Put in a Peek stage after step 1 to view the strings. Make sure there's no leading spaces, etc.

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
Not yet rated
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Thu Oct 26, 2017 12:59 pm Reply with quote    Back to top    

chulett,
I will give that a shot and see how that performs. Could you elaborate on what IsValidDate is doing that I am misunderstanding? based on the following KB, it seems that the only difference between IsValid() with date type specified and IsValidDate() is that IsValid does not log warnings, which means I could remove the message handling to suppress those warnings that I have in place for this job, which is a huge plus. Am I missing anything else?
KB: https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ds.parjob.dev.doc/topics/r_deeref_Type_Conversion_Functions.html

qt_ky,
I did get this to work as well! Very Happy I had to look closer at my data sources and expand my test scenarios. I am trimming any leading spaces in Step 1 as part of the DecimalToString conversion. The $1 is just there to indicate a link.column so that I can use the code as is with the Derivation Substitution option in the stage.

So while it does work now, I am seeing a pretty large issue that its choking on this process at 27 rows/sec now. In my job, I've had to use that nested IF THEN ELSE block for about 65 columns or so with about 40k rows from this data source. We only have a two node setup but normally 40k rows takes no time at all. I've killed it! Embarassed

UPDATE: 23 minutes to process 39900 rows, the only real transformation are those 65 "date" columns and a very small handful of Change(Trim(DecimalToString()," ",B),".","") conversions.[/url]

_________________
-Me
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42285
Location: Denver, CO
Points: 217115

Post Posted: Thu Oct 26, 2017 2:02 pm Reply with quote    Back to top    

People have run afoul of the difference between the two functions in the past, which is why I pointed it out. I don't recall the gory details of the "why" of it, however. Part of it might be the fact that you end up converting it twice, once to check to see if it can be converted and then as a separate step converting it for real. AFAIK, it was meant to be used after conversion, say after a StringToDate() call to a "date" stage variable and then using IsValidDate() on the stage variable to see if the conversion was successful. More typical would be to use IsValid to see if the string can be converted to a date and if not, send it whatever default / empty value it needs to be set to.

Funny, was coming here to post the exact same documentation URL.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
Not yet rated
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Thu Oct 26, 2017 2:28 pm Reply with quote    Back to top    

Thanks chulett, this is interesting, so I have my new derivation as follows using IsValid():
VERSION 4
Code:
IF $1 <> 0
THEN (
   IF (IsValid("date",StringToDate(Right("0":$1,8),"%mm%dd%yyyy")) AND (StringToDate(Right("0":$1,8),"%mm%dd%yyyy") > YearsInPast100 AND StringToDate(Right("0":$1,8),"%mm%dd%yyyy") < YearsInFuture100))
   THEN StringToDate(Right("0":$1,8),"%mm%dd%yyyy")
   ELSE (
      IF (IsValid("Date",StringToDate(Right($1,7),"%yyyy%ddd")) AND (StringToDate(Right($1,7),"%yyyy%ddd") > YearsInPast100 AND StringToDate(Right($1,7),"%yyyy%ddd") < YearsInFuture100))
      THEN StringToDate(Right($1,7),"%yyyy%ddd")
      ELSE (
         IF (IsValid("Date",StringToDate(Right("0":$1,6),"%mm%dd%1950yy")) AND (StringToDate(Right("0":$1,6),"%mm%dd%1950yy") > YearsInPast100 AND StringToDate(Right("0":$1,6),"%mm%dd%1950yy") < YearsInFuture100))
         THEN StringToDate(Right("0":$1,6),"%mm%dd%1950yy")
         ELSE SetNull()
         )
      )
   )
ELSE SetNull()


As you can see I am still embedding the actual conversion function within the IsValid() function. If I am understanding you correctly I don't need to do that with IsValid()? If not do I still need to specify what format the decimal date is in? (E.G. %mm%dd%yyyy vs. %yyyy%ddd)

UPDATE:
So I did go ahead and test the following on just one column and it did still seem to convert the date appropriately, and this was a %mm%dd%1950yy format so it would have been the last one to be tested in the IF THEN ELSE. Seems to work?? I may give it a full run shot tomorrow. Here is the testing IF THEN ELSE:
TESTING
Code:
IF $1 <> 0
THEN (
   IF (IsValid("date",Right("0":$1,8),"%mm%dd%yyyy") AND (StringToDate(Right("0":$1,8),"%mm%dd%yyyy") > YearsInPast100 AND StringToDate(Right("0":$1,8),"%mm%dd%yyyy") < YearsInFuture100))
   THEN StringToDate(Right("0":$1,8),"%mm%dd%yyyy")
   ELSE (
      IF (IsValid("Date",Right($1,7),"%yyyy%ddd") AND (StringToDate(Right($1,7),"%yyyy%ddd") > YearsInPast100 AND StringToDate(Right($1,7),"%yyyy%ddd") < YearsInFuture100))
      THEN StringToDate(Right($1,7),"%yyyy%ddd")
      ELSE (
         IF (IsValid("Date",Right("0":$1,6),"%mm%dd%1950yy") AND (StringToDate(Right("0":$1,6),"%mm%dd%1950yy") > YearsInPast100 AND StringToDate(Right("0":$1,6),"%mm%dd%1950yy") < YearsInFuture100))
         THEN StringToDate(Right("0":$1,6),"%mm%dd%1950yy")
         ELSE SetNull()
         )
      )
   )
ELSE SetNull()


Thank you guys so much for the feedback!! I am quite impressed with the community here, this line of work is slightly new to me if you couldn't guess.

_________________
-Me
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42285
Location: Denver, CO
Points: 217115

Post Posted: Thu Oct 26, 2017 4:31 pm Reply with quote    Back to top    

Right... do not do the actual conversion in the IsValid() function call, just pass it the string and the format mask to use for the test. Then if it is valid, go ahead and actually convert it.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
Not yet rated
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2668
Location: USA
Points: 19836

Post Posted: Fri Oct 27, 2017 6:52 am Reply with quote    Back to top    

Here is a possible optimization that may use fewer CPU cycles.

In place of setting YearsInPast100 and YearsInFuture100, set a stage variable once: CurrentYear = YearFromDate(CurrentDate())

In the derivations do one absolute value test, instead of having two StringToDate(Right("0":$1,#),"%...")) evaluations:

Abs(YearFromDate(StringToDate(...)) - CurrentYear) < 100

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42285
Location: Denver, CO
Points: 217115

Post Posted: Fri Oct 27, 2017 8:30 am Reply with quote    Back to top    

Excellent suggestion... basically create a static variable by putting that derivation in as the stage variable's Initial Value and then (from what I recall) simply leave its derivation in the Transformer empty so it doesn't change. Worst case if it wants something there, set it to itself. Anything that doesn't change row-by-row can be handled like that.

And forgot to mention something you said in your initial post. There shouldn't be a need to do this in two separate Transformers, one should suffice.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
Not yet rated
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Fri Oct 27, 2017 9:00 am Reply with quote    Back to top    

qt_key,
Thank you for that suggestion! I did set a StageVar for the YearsInFuture/Past values but I did make the mistake of putting in the actual derivation to get that value both in the default value and the Derivation so yes that would make sense that it would be recalculating that every time. The Absolute is indeed a much cleaner approach, thank you very very much!

chulette,
I will definitely consolidate into a single transform, for the sake of testing and trying to retain some sanity I had split that out so that I did not have to embed Trim() and DecimalToString() within the IF THEN ELSE derivation. Hopefully consolidating those however will speed things up.

I will update again here once I have made these changes and done another test run.

UPDATE:
ok, I have completed my edits of the job so its basic structure is as follows:
DBIN -> Transform -> DBOUT

This is for what we are calling an intermediate staging job, basically taking RAW data from a stage table and transforming all the columns as needed for further use and outputting that to and intermediate stage table. Mostly this just means switching from char to varchar and extending lengths, there are 5-10 basic decimaltostring() conversions, then 60ish date conversions, which inherently require decimaltostring() among other verification steps as we have previously discussed. Our InfoSphere datastage server only has 2 CPUs at one core each, so we've configured with 2 nodes. Most of the transformations are those dates and this is using the following derivation now:
VERSION 6
Code:
IF $1 <> 0
THEN (
   IF (IsValid("date",Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),8),"%mm%dd%yyyy") AND (Abs(YearFromDate(StringToDate(Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),8),"%mm%dd%yyyy")) - CurrentYear) < 100))
   THEN StringToDate(Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),8),"%mm%dd%yyyy")
   ELSE (
      IF (IsValid("date",Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),7),"%yyyy%ddd") AND (Abs(YearFromDate(StringToDate(Right(Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),7),"%yyyy%ddd")) - CurrentYear) < 100))
      THEN StringToDate(Right(Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),7),"%yyyy%ddd")
      ELSE (
         IF (IsValid("date",Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),6),"%mm%dd%1950yy") AND (Abs(yearFromDate(StringToDate(Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),6),"%mm%dd%1950yy")) - CurrentYear) < 100))
         THEN StringToDate(Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),6),"%mm%dd%1950yy")
         ELSE SetNull()
         )
      )
   )
ELSE SetNull()


This incorporates the suggestions you guys have made and did indeed increase processing speed by 4x. So it runs at 85 rows/sec now, MUCH better than 20 rows/second I was getting before. However, at 8 minutes for the job, and this is only one of MANY that is going to stack up and I believe these will need to be daily jobs. Can you guys think of any other options I might have to speed things up?

_________________
-Me
Rate this response:  
Not yet rated
asorrell
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1637
Location: Colleyville, Texas
Points: 22256

Post Posted: Sat Oct 28, 2017 10:07 pm Reply with quote    Back to top    

Suggestion 1 -
Do some analysis on the files and determine which of the three formats is the most common. Re-order your If statement to do that test first. Then do the least frequent format test last. That way you "fall out" of the test as quickly as possible the majority of the time. If your data is very skewed towards a single format, that would definitely help.

Suggestion 2 -

In each of the three "sections" you are doing a format three times - for example in the first section you are doing

Right("0":Change(Trim(DecimalToString($1,"suppress_zero")," ","B"),".",""),8)

three times in the first "If".

It might be faster do all three unique reformats "up front" and store each of them in three stage variables. Then you could substitute the stage variables in the If statement in all the various spots. That way you only do 3 reformats instead of 9.

Worst case - it passes the first test and you don't save anything (3 reformats old way, 3 reformats new way). Best case, it falls through to the final test and you saved six reformats (9 old way, 3 new way).

If your data is evenly spread among the three formats, then this should have an impact.

Suggestion 3 -

Kick the data supplier in the arse and tell them that supplying data in a non-standardized format is unacceptable!

May not speed up the job but at least you'll feel better!

_________________
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2017
Rate this response:  
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2668
Location: USA
Points: 19836

Post Posted: Mon Oct 30, 2017 7:18 am Reply with quote    Back to top    

I like suggestion 3!! Smile

Would it help to pre-check each test based on the length of the incoming string column or stage variable? It might save a bunch of complex evaluations.

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
Not yet rated
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Mon Oct 30, 2017 9:59 am Reply with quote    Back to top    

asorrell,
I can and will absolutely do suggestion 1, I didnt think of that as I had my mind stuck on them having to be in that order, but that was before I was doing date within 100 years past/future checking. This SHOULD help a lot I think.

I will definitely look into Suggestion 2, however I believe I would need to create 3 variables for each date column, which would be close 200 variables. I have yet to look into this at all but what about loop conditions? (no need to spoon feed me there I will absolutely do some research on that when I get a moment).

As for suggestion 3, you and qt_key are both absolutely correct, and it has been brought up, AND if we cannot get my process to function at an adequate speed then we will merely revert to converting for the date type that the metadata says it should be and let them know that some dates may show null due to deviations in their entries.

I will update this thread again soon with my progress and results. Thanks again, you guys are amazing. I used to be a server administrator and developer. I worked with a pretty specialized software that had some very odd quirks, but the community was SO good. I am extremely happy to have that same sense of community here. Smile

_________________
-Me
Rate this response:  
Not yet rated
jackson.eyton
Participant



Joined: 26 Oct 2017
Posts: 39

Points: 823

Post Posted: Tue Oct 31, 2017 2:31 pm Reply with quote    Back to top    

Great News!!! Reordering the IF THEN ELSE logic to hit the columns defined date format first has resolved the performance issue. Its not as good as some jobs but its back up to 800+ Rows/sec and I will take that. Thank you guys a TON for the amazingly helpful input!

_________________
-Me
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours