Split - Output XML Compressed File

chulett · Post by **chulett** » Mon Sep 15, 2014 11:26 am

You make one up if need be, it doesn't even have to be part of the output from what I recall by not giving it an XPath Expression. And do I understand correctly that you have two writers writing to the same sequential target simultaneously? If so, that's not going to work I'm afraid.

eostic · Post by **eostic** » Mon Sep 15, 2014 1:51 pm

There are some interesting things in this thread.....

...first, as one would expect me to say, I disagree with any objections to XML "in general"......

...but secondly, I agree wholeheartedly on that topic when the need is for "huge" xml [so...all good concerns and criticisms, Eric, Craig... ; ) ]. Definitely a waste of space and energy. JSON is better, but even that could be debatable when compared to other effective storage and transport mechanisms.

Regarding this issue, one major thing catches me......the thread starts with "building ONE document". If that's the case, it CANNOT be broken up after the fact and expect EACH PIECE to be usable after unzipping. If you create a document of 172 Meg, and then zip it into 50/50/50/22 meg pieces...then they must be put back together into a single document in order to be usable. Anything short of that is not just a potential tag damaging issue, but could be a structural issue.....unless the xml is rediculously primitive and just a set of relational rows in a single repeating node (then what's the point of using xml in the first place)?

You might "logically" break them up, which is the trigger idea....breaking upon some real or artificial grouping of rows. If the data is predictable, coming from one source, and just "n" rows, you might be able to do this quite easily.....if the data is not predictable, and coming from 18 different sources with a complex multi-path hierarchy --- much more difficult.

How is the xml being consumed at the other end?

Ernie

chulett · Post by **chulett** » Mon Sep 15, 2014 6:48 pm

Again with the repeating of everything.

Ernie, we've already had the "you can't just split it" conversation since Joyce had previously confirmed they will be consumed individually. And I'm guessing that it really does fall into your "ridiculously primitive" camp of XML if the posted examples are literal representations of all each file needs to contain.

ray.wurlod · Post by **ray.wurlod** » Mon Sep 15, 2014 6:50 pm

It's perfectly clear.

What we've all been trying to tell you is there is no reliable way to split the file AFTER it has been compressed.

You must split either before or during compression.

If you split during compression you run the risk that an XML element may be split across file boundaries. If the consumers of the files are OK with that, then go ahead. That is, is their read method to decompress, then recombine, then read?

joycerecacho · Post by **joycerecacho** » Mon Sep 15, 2014 7:40 pm

Sorry guys ask you if I was clear cause my English is so bad.

Obviously I couldnt split the xml file after it was compressed.
In short, I just wanna know if there is a way I could split it before or after generate it. Forget the compression for now.
This limitatiom of 50Mb I could set a limit of rows for example.
This is not the point now, not the most important.

The thing is "How could I solve it using DS, instead of shell script?"
How Could I split the file into smaller ones, all of them valid and with their headers? Not forgeting the File sequential Id.

Best regards,

eostic · Post by **eostic** » Mon Sep 15, 2014 11:01 pm

It's a simple single path relational "tree" structure. Get the data into one single tuple.....header info.....customer info....contract info.....

From there you should be able to "reasonably" guess how many rows will get you close to 50 meg. ...and if you play games with counts of how many customers there are per header and how many contracts, with some upstream processing, be able to get it very close (might take multiple passes of the data to achieve the counts and size totals).

Ultimately "calculating" the points at which you need to make a "break" to a new file. That "point" will be where you need to (a) set up a trigger field, using counters, that can be utilized in one of the xml Stages to create a break into a new file....or (b) more easily, set up counters at "that point" to send a new end-of-wave via the end of wave operator. End of wave will "clean out" the xml stage and have it cut a whole new xml document.

There's a lot of logic that will have to be figured out, but as Ray notes, do this proactively upstream and create the sizes you need.

Ernie

chulett · Post by **chulett** » Wed Sep 17, 2014 4:46 pm

You have no control over the filename other than you provide the 'base' name in the stage and it numbers them from there. You will need to script something to run afterwards to rename them according to your business rules.

joycerecacho · Post by **joycerecacho** » Thu Oct 02, 2014 2:52 pm

Thank you Chullet!
I used shell script to specify the file names.
It worked perfectly.

Thank you guys for your help.

Best regards,