Parse data

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sam334
Premium Member
Premium Member
Posts: 145
Joined: Mon Aug 26, 2013 7:42 pm

Parse data

Post by sam334 »

Hi All,
Need a help on parsing data
I have a sql database column with 3000 bytes.

Sample data looks like for that single row,

23:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like,
23:05:05: Aaaaaaaaaaaa
23:05:05: Bbbbbbbbbbbb
23:05:05: BBB: 9830015874
23:05:05: AAHH: 1111

Any thoughts on this.Particularly what to use in transformer. I built another job before which has pipe delimiter, but this is giving trouble.
Appreciate your help.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's probably easiest - assuming an arbitrary number of time/string pairs and arbitrary string lengths - to pre-parse this string into a text file in a before-stage subroutine (making use of MatchField() function), then have your job read from that file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ssnegi
Participant
Posts: 138
Joined: Thu Nov 15, 2007 4:17 am
Location: Sydney, Australia

Post by ssnegi »

Stage Variable :
dlimit : left(DSLink.COL,2)
dcnt : DCount(DSLink.COL,dlimit) - 1

Loop Condition
Loop While : loopcnt < dcnt
Loop Variable
loopcnt : Initial Value 0 Derivation : loopcnt+1
LoopVarValue:field(DSLink.COL,dlimit,loopcnt+1)[9,len(field(DSLink.COL,dlimit,loopcnt+1))]
LoopVarTime : dlimit:':':left(field(DSLink.COL,dlimit,loopcnt+1),7)

Columns :
COL1 : LoopVarTime
COL2 : Trim(LoopVarValue)

This will work if the time hour per line is the same ie the line has 23 as the hour for all the columns in the row.
If you can put a pipe delimter then it would be much easier. replace 23 by | in the dcnt variable.
you can modify this to use 23:05 as delimter if that is constant in the line so that the possibility of it occuring in the value string diminish.
Last edited by ssnegi on Tue Apr 29, 2014 10:43 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is no looping capability in the Transformer stage in server jobs.

Further, you can't assume that times will always begin with 23:05. You must learn to create general solutions.

For example, the number of time/string pairs in the string is one quarter of the number of colon-delimited fields assuming no colons in the string values (think about it).

So the time/string pairs could be extracted by counting the number of colons, and creating a loop that starts from 1, has that value as its limit, and an increment of 4.

The Field() function could be used, with a third argument of the loop counter and a fourth argument of 4, to extract a time/string pair. Further Field() functions can be used to separate the time and string for transfer to the output.

Code: Select all

SUBROUTINE ParseTimeString(InputValue, ErrorCode)
Equate TAB To Char(9)
DEFFUN OpenSequentialFile(FileName,OpenMode,AppendMode,LogMode) Calling "DSU.OpenSequentialFile"

ErrorCode = 0

hInFile = OpenSequentialFile(InputValue, "R", "A", "Y")
hOutFile = OpenSequentialFile(InputValue:".out", "W", "O", "Y")
WriteCount = 0

Loop
While ReadNext Line From hInFile
    PairCount = Count(Line, ":")
    For Counter = 1 To PairCount Step 4
        TimeString = Field(Line, ":", Counter, 4)
        TimePortion = Field(TimeString, ":", 1, 3)
        StringPortion = Field(TimeString, ":", 4, 1)
        OutLine = TimePortion : TAB : StringPortion
        WriteSeq OutLine To hOutFile Then WriteCount += 1
    Next Counter
Repeat

CloseSeq hOutFile
CloseSeq hInFile

RETURN
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sam334
Premium Member
Premium Member
Posts: 145
Joined: Mon Aug 26, 2013 7:42 pm

Post by sam334 »

Yes, Ray is correct, it is a call log. So, will have calls from 6AM to 11.59 PM.
sam334
Premium Member
Premium Member
Posts: 145
Joined: Mon Aug 26, 2013 7:42 pm

Post by sam334 »

Ray, there in one previous job, we have pipe delimiters and two stage variables.

The data looked like,

9999,0,0,20150925125011,201409251259940,,,,,
|9915,0,0,201509254857441,20140925151088,,,,, and so on.

To parse it we used--
StgVar= Change(DSLink1.Column4, "|", "|": DSLink1.Column2:",")
StgVar1=DSLink1.Column2:",": Change(stgVar,"|",CHAR(013):CHAR(010))

We are adding the phone number as column 2 with each parsed row. Can we do anything like that instead of Subroutine for the aforesaid data too. Have to add the ID column infront of each row.

Thanks for your time and help.


3:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like,
23:05:05: Aaaaaaaaaaaa
23:05:05: Bbbbbbbbbbbb
23:05:05: BBB: 9830015874
23:05:05: AAHH: 1111
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

My earlier code assumed (incorrecly) that there was a colon between the string and the next time. The code can be changed to account for that - I leave it as an exercise for the reader, with the hint that you discard that part of the string already processed each iteration through the loop.

As for your new design, have you tried it, and does it produce the expected result? It's not clear to me what's in Column4 and what's in Column2.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sam334
Premium Member
Premium Member
Posts: 145
Joined: Mon Aug 26, 2013 7:42 pm

Post by sam334 »

Ray,
Thanks for your reply. I tried the code, it was not giving the expected result. I will let you know what was the error.

Column 2 is the call id (which we added in beginning of each parsed row) and column 4 is the column which has the data to be parsed.

Thanks again.
sam334
Premium Member
Premium Member
Posts: 145
Joined: Mon Aug 26, 2013 7:42 pm

Post by sam334 »

Have a question on the code. It was compiling fine but in the code it says sequential file right. But we have the column in database which will be parsed in multiple rows.

Initial stages look like,

ODBC--TRANSFORMER--SEQUENTIAL FILE.

So, if the code is used in transformer it is not about sequential file right. Do we need to change here anything else? any arguments instead of sequential file?

Thanks,

PS-
For the below data.we need to add the call id in front of each parsed row.

3:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like, (assuming 9162221212 is the call id)
9162221212,23:05:05: Aaaaaaaaaaaa
9162221212,23:05:05: Bbbbbbbbbbbb
9162221212,23:05:05: BBB: 9830015874
9162221212,23:05:05: AAHH: 1111
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sequential File stage is a passive stage, and does not do any work (apart from formatting columns) on the data. Transformer stage is an active stage and is where the work (for example adding the call ID) must be done.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ray, I think the confusion is around the "SUBROUTINE ParseTimeString" that you posted. It writes its output to a sequential file and Sam is looking to leverage it in a Transformer derivation, thinking it should pass the parsed data back to the caller.

At least that's what I think I'm reading. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Fair enough. Then the job could write untransformed data into the flat file that the subroutine processes, and the routine could be invokved as a before-stage subroutine in the second Transformer stage.

Code: Select all

ODBC ---->  Transformer  ---->  SeqFile  ----> Transformer  ---->  SeqFile
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

SUBROUTINE ParseTimeString(InputValue, ErrorCode) 
Equate TAB To Char(9) 
DEFFUN OpenSequentialFile(FileName,OpenMode,AppendMode,LogMode) Calling "DSU.OpenSequentialFile"
 
ErrorCode = 0 

hInFile = OpenSequentialFile(InputValue, "R", "A", "Y") 
hOutFile = OpenSequentialFile(InputValue:".out", "W", "O", "Y") 
WriteCount = 0 

Loop 
While ReadNext Line From hInFile 
    Loop
        Pair = Field(Line, ":", 4, 1)
        Pair = Left(Pair, Len(Pair) - 3)
        Time = Field(Pair, ":", 1, 3)
        Rest = Field(Pair, ":", 4, 1)
        WriteSeq Time : ": " :  TAB : Rest To hOutFile Then WriteCount += 1
        Line = Right(Line, Len(Line) - Len(Pair))
    While Len(Line) > 0
    Repeat
Repeat 

CloseSeq hOutFile 
CloseSeq hInFile 

RETURN
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ray.wurlod wrote:Fair enough.
Thanks... didn't want to put words in your mouth and knew you could explain how you saw this playing out for those of us a little behind the curve. That and I didn't have any kind of time for a proper reply. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply