'Chunk' XML Files before parsing

sjordery · Post by **sjordery** » Wed May 28, 2008 5:13 am

Hello All,

I have a requirement to parse multiple XML files, many of which are over 2Gb in size. As many of you will know, due to the way in which DataStage handles XML parsing, there is a limit on file size, and when that is breached, you'll see the jobs abort with:

APT_CombinedOperatorController,0: Operator terminated abnormally: Terminating with exception:APT_BadAlloc: Heap allocation failed.

Now, the approach that I have been using is to split the incoming XML into 'chunks' and using a loop to parse the chunks. Using incoming file:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header>
<DataItem>12345</DataItem>
<DataItem>13456</DataItem>
<DataItem>14567</DataItem>
<DataItem>15678</DataItem>
<DataItem>16789</DataItem>
<DataItem>17890</DataItem>
</OutermostTag>

The job takes this input and reads sequentially, it writes forward 2 columns:

Code: Select all

RowCount & OutputRow

Index is used to establish if the incoming row is a data item, or a header/footer item. When a data item is found, the row count is incremented and set for that row. The next transformer uses row count - if 0 then it is assumed to be a header row and is written to the header file. If >0 it is a data item, so written to the DataItem file. Another check against a specific marker on the Footer row is used to write out footer data to a Footer file.

Job in the loop puts together input files which have:

Code: Select all

1 x Header
n x DataItems
1 x Footer

Now, this approach works fine when the incoming XML is as above. Problems have started because the incoming XML now looks like:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header><DataItem>12345</DataItem>
<DataItem>13456</DataItem>
<DataItem>14567</DataItem>
<DataItem>15678</DataItem>
<DataItem>16789</DataItem>
<DataItem>17890</DataItem>
</OutermostTag>

When this occurs, the index matches the </Header> row as a DataItem row, and this mean that only the first iteration of the chunks is correct - subsequent ones are missing the </Header> tag.

I'm told that the XML might even look like this at some point:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header><DataItem>12345</DataItem><DataItem>13456</DataItem><DataItem>14567</DataItem>
<DataItem>15678</DataItem><DataItem>16789</DataItem><DataItem>17890</DataItem></OutermostTag>

That would definately mess up this technique.

Can anyone suggest a neater way of chunking up XML - especially XML that might not contain whitespace etc where you might expect it to. It might be that there is an easier solution to this, but at the moment I am struggling to see the wood for the trees.

I presume that there must be other users parsing huge XML files - what techniques are you using?

Many thanks,
S

chulett · Post by **chulett** » Wed May 28, 2008 6:22 am

Interesting... I'm going to keep an eye on this and see what comes of it, that and if it attracts our resident XML gadfly Ernie.

We very rarely have this issue, tending instead getting a large number of smaller files rather then getting a small number of larger files. And when we've seen the gigantor file, I've parsed it myself. Manually. [sigh]

I personally think generating files of that size is... well, just wrong and doesn't do anyone on the receiving end any favors.

sjordery · Post by **sjordery** » Wed May 28, 2008 7:21 am

chulett wrote:I personally think generating files of that size is... well, just wrong and doesn't do anyone on the receiving end any favors.

I have to admit that those words have echoed around my head more than once.

Cheers,
S

VCInDSX · Post by **VCInDSX** » Wed May 28, 2008 7:32 am

I agree with the feedback on the drawback on gigantic files...

When we had a similar situation... I ended up coding a Perl script (Manual parsing) to convert this mega XML to a manageable form.... Again, in line with Craig's...

Good luck,

sud · Post by **sud** » Wed May 28, 2008 9:34 am

A rather crude way of getting the file to look the way you want to be able to chunk it (it looks hilarious since I am particularly poor at sed and awk):

Code: Select all

sed 's_<[A-Z]_*&_g' testxml.txt | sed 's_</[^D]_*&_g' | tr '*' '\n'

testxml.txt is your xml file.

This will basically render your file in the form you want to be able to chunk it, with some extra lines - which I am sure any experienced sed and awk user will be able to get rid of. This can be one way of standardising the xml file into the nice format you want.

Although I have not used huge xml files, but have had cases where we needed to do multiple progressive parses. I have used buildop in that case. Read the file in the buildop and spit out only the values as and when you get them. The problem with any such technique is they should all be classified as "manual" since they do not take the xsd as an input and hence are not generic to handle any xml spec. This said, if your xml structure (tag definitions) are fixed, this can be effective and can give huge benefits of performance.

sjordery · Post by **sjordery** » Wed May 28, 2008 10:41 am

Hey sud - thanks for this, it's an approach that I hadn't really considered, so will see if I can fit it in.

I am interested in the build-op technique, and (due to the type of xsd's we use not working well with DS meta data importer - long story, different thread) I am using manually coded xpaths, so that might be a good option. Would you be able to provide more details please (or point me in the right direction to find them) as I've not use build-ops before.

Thanks
S

sjordery · Post by **sjordery** » Wed May 28, 2008 11:16 am

sud wrote:A rather crude way of getting the file to look the way you want to be able to chunk it (it looks hilarious since I am particularly poor at sed and awk):
Code: Select all
sed 's_<[A-Z]_*&_g' testxml.txt | sed 's_</[^D]_*&_g' | tr '*' '\n'
testxml.txt is your xml file.

This will basically render your file in the form you want to be able to chunk it, with some extra lines - which I am sure any experienced sed and awk user will be able to get rid of. This can be one way of standardising the xml file into the nice format you want.

Where would you place this in your job? Am I right to say that this doesn't actually modify the file per se - ie. If I ran this command as an ExecSH before the job, the job would use the file in its original format?

Thanks
S

sud · Post by **sud** » Wed May 28, 2008 11:17 am

Well, starting with a buildop - a quick look at the example in the advanced guide might get you started, otherwise we(and I) will definitely help you out. But even before that, the most important thing to consider is the varying xsd's - you should not end up with different buildops for different tagnames/definitions ... schemas so to say, and yet, it is nearly unrealistic to try to create a whole xml parser using a buildop.

Some realistic example, the example in your case where you always fetch the data between the <DataItem> tag, if all your cases pertain to similar parsing where just the tagname changes then the tagname "DataItem" could be given as an input to the buildop and it would output as many records as there are occurences of matching tags and get the value between them.

Infact, if the xmls are really that simple in format, use a combination of sed and awk, say try this on one of the files that you have:

Code: Select all

sed 's_<[A-Z]_*&_g' testxml.txt | sed 's_</[^D]_*&_g' | tr '*' '\n' | sed 's_<DataItem>_<DataItem> _g' | sed 's_</DataItem>_
</DataItem>_g' | awk '$1=="<DataItem>"{print $2}'

after changing the filename. Infact, this strategy can be used to get rid of a lot of unwanted stuff from your xml file and prune it up as a pre-process.

Coming back to the buildop, the buildop could be modelled as a source stage where it takes properties of filename and say some tag definition info(keeping things very general here, say to begin with it could be the input of "DataItem" as a string). Now, the initialization step will be to open the file in read mode (use fread64() for very big files) and then we can use functions like strchr() to jump to starting positions of the tag we are looking for(please don't get bogged down by syntax, you can refer to http://www.cplusplus.com/reference/clibrary for all ANSI C functions which will always work with datastage) and do a sequential read till the position of the closing tag (using strspn()) and for every such occurence wrteout one record. The loop activities in buildop are not needed since we will not go through any input records.

If this is the first time you are about to develop a buildop it might sound complex, but really it is not. If you read through the advanced guides example and try out few cases of input/output control through code, you will figure out how flexible things can be.

sud · Post by **sud** » Wed May 28, 2008 11:19 am

Well, the sed/awk commands can be executed in shell(you can use execsh from datastage) and use a "> filename" to store the output in a file and use the new file instead.

rameshrr3 · Post by **rameshrr3** » Wed May 28, 2008 2:11 pm

I had a similar issue with large XML files, ended up using SED/AWK as noted, that could be your best workaround. ( The XML input stage terminated abruptly)

sjordery · Post by **sjordery** » Thu May 29, 2008 4:56 am

Thanks sud for you time and input. I'll give it a go.

Best,
S

eostic · Post by **eostic** » Thu May 29, 2008 7:19 pm

This is indeed a tricky problem. Because the whole thing could be one giant chunk (no crlf's or anything), the best solutions I've heard discussed (but haven't actually implemented or seen anyone implement) are those where a SAX ("event" based XML API that "walks" thru the xml document element by element) coded program using java or other language that supports a SAX based parser library is used before DataStage to break up the document logically. Outside of coding something yourself, you might consider using MapStage and invoking TX to do this, as it's XML reading capability is apparently SAX oriented, but I appreciate that that's a cost based solution and not every site has TX already installed [but many do as it has a large install base --- it's worth checking]. Engineering is looking at this I'm told, but we'll have to be patient.

In the meantime, I agree with many of the comments above regarding size. We see large files, but even as often as they come up, it's a tiny percentage when compared to the vast amount of XML that is getting processed by all of you with DataStage and other tools, all of which is far smaller than 500M, but often in large numbers (like 1000's of 200k xml documents each night). Still, it needs to be addressed.

Ernie

sjordery · Post by **sjordery** » Mon Jun 02, 2008 9:43 am

Thanks for the input Ernie, much appreciated.

Regards,
S