I have a requirement to parse multiple XML files, many of which are over 2Gb in size. As many of you will know, due to the way in which DataStage handles XML parsing, there is a limit on file size, and when that is breached, you'll see the jobs abort with:
Now, the approach that I have been using is to split the incoming XML into 'chunks' and using a loop to parse the chunks. Using incoming file:APT_CombinedOperatorController,0: Operator terminated abnormally: Terminating with exception:APT_BadAlloc: Heap allocation failed.
Code: Select all
<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header>
<DataItem>12345</DataItem>
<DataItem>13456</DataItem>
<DataItem>14567</DataItem>
<DataItem>15678</DataItem>
<DataItem>16789</DataItem>
<DataItem>17890</DataItem>
</OutermostTag>
Code: Select all
RowCount & OutputRow
Job in the loop puts together input files which have:
Code: Select all
1 x Header
n x DataItems
1 x Footer
Code: Select all
<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header><DataItem>12345</DataItem>
<DataItem>13456</DataItem>
<DataItem>14567</DataItem>
<DataItem>15678</DataItem>
<DataItem>16789</DataItem>
<DataItem>17890</DataItem>
</OutermostTag>
I'm told that the XML might even look like this at some point:
Code: Select all
<?xml version="1.0" encoding="utf-8"?>
<OutermostTag xmlns:xxx="http://www.amadeupurl.com/blahblah">
<Header>
<SomeInfo>Test</SomeInfo>
</Header><DataItem>12345</DataItem><DataItem>13456</DataItem><DataItem>14567</DataItem>
<DataItem>15678</DataItem><DataItem>16789</DataItem><DataItem>17890</DataItem></OutermostTag>
Can anyone suggest a neater way of chunking up XML - especially XML that might not contain whitespace etc where you might expect it to. It might be that there is an easier solution to this, but at the moment I am struggling to see the wood for the trees.
I presume that there must be other users parsing huge XML files - what techniques are you using?
Many thanks,
S