Parse data

sam334 · Post by **sam334** » Tue Apr 29, 2014 2:42 pm

Hi All,
Need a help on parsing data
I have a sql database column with 3000 bytes.

Sample data looks like for that single row,

23:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like,
23:05:05: Aaaaaaaaaaaa
23:05:05: Bbbbbbbbbbbb
23:05:05: BBB: 9830015874
23:05:05: AAHH: 1111

Any thoughts on this.Particularly what to use in transformer. I built another job before which has pipe delimiter, but this is giving trouble.
Appreciate your help.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 29, 2014 4:16 pm

It's probably easiest - assuming an arbitrary number of time/string pairs and arbitrary string lengths - to pre-parse this string into a text file in a before-stage subroutine (making use of MatchField() function), then have your job read from that file.

ssnegi · Post by **ssnegi** » Tue Apr 29, 2014 8:28 pm

Stage Variable :
dlimit : left(DSLink.COL,2)
dcnt : DCount(DSLink.COL,dlimit) - 1

Loop Condition
Loop While : loopcnt < dcnt
Loop Variable
loopcnt : Initial Value 0 Derivation : loopcnt+1
LoopVarValue:field(DSLink.COL,dlimit,loopcnt+1)[9,len(field(DSLink.COL,dlimit,loopcnt+1))]
LoopVarTime : dlimit:':':left(field(DSLink.COL,dlimit,loopcnt+1),7)

Columns :
COL1 : LoopVarTime
COL2 : Trim(LoopVarValue)

This will work if the time hour per line is the same ie the line has 23 as the hour for all the columns in the row.
If you can put a pipe delimter then it would be much easier. replace 23 by | in the dcnt variable.
you can modify this to use 23:05 as delimter if that is constant in the line so that the possibility of it occuring in the value string diminish.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 29, 2014 10:39 pm

There is no looping capability in the Transformer stage in server jobs.

Further, you can't assume that times will always begin with 23:05. You must learn to create general solutions.

For example, the number of time/string pairs in the string is one quarter of the number of colon-delimited fields assuming no colons in the string values (think about it).

So the time/string pairs could be extracted by counting the number of colons, and creating a loop that starts from 1, has that value as its limit, and an increment of 4.

The Field() function could be used, with a third argument of the loop counter and a fourth argument of 4, to extract a time/string pair. Further Field() functions can be used to separate the time and string for transfer to the output.

Code: Select all

SUBROUTINE ParseTimeString(InputValue, ErrorCode)
Equate TAB To Char(9)
DEFFUN OpenSequentialFile(FileName,OpenMode,AppendMode,LogMode) Calling "DSU.OpenSequentialFile"

ErrorCode = 0

hInFile = OpenSequentialFile(InputValue, "R", "A", "Y")
hOutFile = OpenSequentialFile(InputValue:".out", "W", "O", "Y")
WriteCount = 0

Loop
While ReadNext Line From hInFile
    PairCount = Count(Line, ":")
    For Counter = 1 To PairCount Step 4
        TimeString = Field(Line, ":", Counter, 4)
        TimePortion = Field(TimeString, ":", 1, 3)
        StringPortion = Field(TimeString, ":", 4, 1)
        OutLine = TimePortion : TAB : StringPortion
        WriteSeq OutLine To hOutFile Then WriteCount += 1
    Next Counter
Repeat

CloseSeq hOutFile
CloseSeq hInFile

RETURN

sam334 · Post by **sam334** » Wed Apr 30, 2014 8:58 am

Yes, Ray is correct, it is a call log. So, will have calls from 6AM to 11.59 PM.

sam334 · Post by **sam334** » Fri May 02, 2014 7:51 am

Ray, there in one previous job, we have pipe delimiters and two stage variables.

The data looked like,

9999,0,0,20150925125011,201409251259940,,,,,
|9915,0,0,201509254857441,20140925151088,,,,, and so on.

To parse it we used--
StgVar= Change(DSLink1.Column4, "|", "|": DSLink1.Column2:",")
StgVar1=DSLink1.Column2:",": Change(stgVar,"|",CHAR(013):CHAR(010))

We are adding the phone number as column 2 with each parsed row. Can we do anything like that instead of Subroutine for the aforesaid data too. Have to add the ID column infront of each row.

Thanks for your time and help.

3:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like,
23:05:05: Aaaaaaaaaaaa
23:05:05: Bbbbbbbbbbbb
23:05:05: BBB: 9830015874
23:05:05: AAHH: 1111

ray.wurlod · Post by **ray.wurlod** » Fri May 02, 2014 6:26 pm

My earlier code assumed (incorrecly) that there was a colon between the string and the next time. The code can be changed to account for that - I leave it as an exercise for the reader, with the hint that you discard that part of the string already processed each iteration through the loop.

As for your new design, have you tried it, and does it produce the expected result? It's not clear to me what's in Column4 and what's in Column2.

sam334 · Post by **sam334** » Fri May 02, 2014 9:08 pm

Ray,
Thanks for your reply. I tried the code, it was not giving the expected result. I will let you know what was the error.

Column 2 is the call id (which we added in beginning of each parsed row) and column 4 is the column which has the data to be parsed.

Thanks again.

sam334 · Post by **sam334** » Thu May 08, 2014 9:03 am

Have a question on the code. It was compiling fine but in the code it says sequential file right. But we have the column in database which will be parsed in multiple rows.

Initial stages look like,

ODBC--TRANSFORMER--SEQUENTIAL FILE.

So, if the code is used in transformer it is not about sequential file right. Do we need to change here anything else? any arguments instead of sequential file?

Thanks,

PS-
For the below data.we need to add the call id in front of each parsed row.

3:05:05: Aaaaaaaaaaaa23:05:05: Bbbbbbbbbbbb23:05:05: BBB: 983001587423:05:05: AAHH: 111123:05:05: XXXXXXXXX23:05:06: YYYYYYYY23:05:06: ZZZZZZZZZZZZ23:05:07: CCCCCCCCCCC23:05:07: SSSSSSSSSSSS23:05:07: QQQQQQQQQQQQQQ23:05:07: XXXXXXXXX: SUMITCHAW23:05:07: AAAAAAAAA23:05:07: TTTTTTTTTTTTTTT23:05:08: AAAAAAAA23:05:08: PPPPPPPPPPPPPPP23:05:08: EEEEEEEEEEE23:05:08: QQQQQQQQQQQQQQQQQQQ23:05:08: QQQQQQQQQ: SumitChaw23:05:08: NNNNNNNNNNNNNNN23:05:08: LLLLLLLLLLLLLL

Now, I want parse the data after the time interval, like, (assuming 9162221212 is the call id)
9162221212,23:05:05: Aaaaaaaaaaaa
9162221212,23:05:05: Bbbbbbbbbbbb
9162221212,23:05:05: BBB: 9830015874
9162221212,23:05:05: AAHH: 1111

ray.wurlod · Post by **ray.wurlod** » Thu May 08, 2014 4:22 pm

Sequential File stage is a passive stage, and does not do any work (apart from formatting columns) on the data. Transformer stage is an active stage and is where the work (for example adding the call ID) must be done.

chulett · Post by **chulett** » Thu May 08, 2014 4:38 pm

Ray, I think the confusion is around the "SUBROUTINE ParseTimeString" that you posted. It writes its output to a sequential file and Sam is looking to leverage it in a Transformer derivation, thinking it should pass the parsed data back to the caller.

At least that's what I think I'm reading.

ray.wurlod · Post by **ray.wurlod** » Thu May 08, 2014 10:16 pm

Fair enough. Then the job could write untransformed data into the flat file that the subroutine processes, and the routine could be invokved as a before-stage subroutine in the second Transformer stage.

Code: Select all

ODBC ---->  Transformer  ---->  SeqFile  ----> Transformer  ---->  SeqFile

ray.wurlod · Post by **ray.wurlod** » Thu May 08, 2014 10:22 pm

Code: Select all

SUBROUTINE ParseTimeString(InputValue, ErrorCode) 
Equate TAB To Char(9) 
DEFFUN OpenSequentialFile(FileName,OpenMode,AppendMode,LogMode) Calling "DSU.OpenSequentialFile"
 
ErrorCode = 0 

hInFile = OpenSequentialFile(InputValue, "R", "A", "Y") 
hOutFile = OpenSequentialFile(InputValue:".out", "W", "O", "Y") 
WriteCount = 0 

Loop 
While ReadNext Line From hInFile 
    Loop
        Pair = Field(Line, ":", 4, 1)
        Pair = Left(Pair, Len(Pair) - 3)
        Time = Field(Pair, ":", 1, 3)
        Rest = Field(Pair, ":", 4, 1)
        WriteSeq Time : ": " :  TAB : Rest To hOutFile Then WriteCount += 1
        Line = Right(Line, Len(Line) - Len(Pair))
    While Len(Line) > 0
    Repeat
Repeat 

CloseSeq hOutFile 
CloseSeq hInFile 

RETURN

chulett · Post by **chulett** » Thu May 08, 2014 10:30 pm

ray.wurlod wrote:Fair enough.

Thanks... didn't want to put words in your mouth and knew you could explain how you saw this playing out for those of us a little behind the curve. That and I didn't have any kind of time for a proper reply.