There is no corresponding option in the stage to control where it puts things. I just want to make sure they don't end up in my project directory by default - if there is any way to control them... an environment variable, perhaps? Or is this something that is somehow not an issue with the Aggregator?
The Aggregator Stage and sorting
Moderators: chulett, rschirm, roy
The Aggregator Stage and sorting
In the actual Sort stage, you can specify where it puts the temporary working files (soa*) that it needs. I'm wondering how that works with the Aggregator when it needs to sort data as well as group it.
There is no corresponding option in the stage to control where it puts things. I just want to make sure they don't end up in my project directory by default - if there is any way to control them... an environment variable, perhaps? Or is this something that is somehow not an issue with the Aggregator?
There is no corresponding option in the stage to control where it puts things. I just want to make sure they don't end up in my project directory by default - if there is any way to control them... an environment variable, perhaps? Or is this something that is somehow not an issue with the Aggregator?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
ray.wurlod
- Participant
- Posts: 54595
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The Aggregator stage doesn't sort data.
It can use pre-sorted data, in order to apply a much more efficient aggregation paradigm, but does not do any sorting itself. That data sometimes appear to come out sorted is an artifact of the aggregation algorithm, and ought not to be relied upon (just like GROUP BY in SQL).
As far as I know, the Aggregator is set up to work entirely in memory. Initial and incremental allocation of memory can be tuned using option 6 from the DS.TOOLS menu. I would surmise, therefore, that the only time data from the Aggregator touches down to disk, it's in ordinary O/S swap space.
Could be interesting to post this question on ADN, where an Ascential engineer may be able to provide facts rather than suppositions.
It can use pre-sorted data, in order to apply a much more efficient aggregation paradigm, but does not do any sorting itself. That data sometimes appear to come out sorted is an artifact of the aggregation algorithm, and ought not to be relied upon (just like GROUP BY in SQL).
As far as I know, the Aggregator is set up to work entirely in memory. Initial and incremental allocation of memory can be tuned using option 6 from the DS.TOOLS menu. I would surmise, therefore, that the only time data from the Aggregator touches down to disk, it's in ordinary O/S swap space.
Could be interesting to post this question on ADN, where an Ascential engineer may be able to provide facts rather than suppositions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks for the clarification. I guess I should have more properly said 'grouping' rather than 'sorting', but the answer seems to be that it all happens in memory rather than using any temporary files like the Sort stage does. So, this leads me to a performance question...
When faced with a situation where large volumes of data cannot be brought into a job in the proper order to support the Aggregation you need, is it a 'better practice' to actually sort it first (say, using the Sort stage) or just take it in and let the Aggregator do its thing? For some silly reason, I'd much rather see data streaming through the Agg when it can rather than it being a total 'bottleneck' in the job.
I know 'large volumes' is subjective, but just wondering what people do or what experience they've had with this issue. Thanks!
When faced with a situation where large volumes of data cannot be brought into a job in the proper order to support the Aggregation you need, is it a 'better practice' to actually sort it first (say, using the Sort stage) or just take it in and let the Aggregator do its thing? For some silly reason, I'd much rather see data streaming through the Agg when it can rather than it being a total 'bottleneck' in the job.
I know 'large volumes' is subjective, but just wondering what people do or what experience they've had with this issue. Thanks!
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
ray.wurlod
- Participant
- Posts: 54595
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If the incoming data are not sorted, or even if they are but you don't tell the Aggregator stage, then the Aggregator stage must keep the progressive set function results (sum, count, etc.) in little buckets in memory, one bucket for each combination of the grouping columns. The total size in memory is a function of the number of grouping columns, the number of aggregation fields, and the data type - and therefore size - of each.
If the incoming data are sorted (by grouping column(s)) and you assert this on the Aggregator's input link, then far less memory can be consumed and speed obtained.
This is because, whenever a value changes on one of the sorted columns, the previous value will never recur, so that the buckets used for that value (and anything sorted later) can be pushed out the output link and that memory freed for re-use. Indeed, only one bucket is needed for the column sorted first.
Normal "idiot proofing" means that a code check must occur that no row is out of order in a grouping column asserted to be sorted, because the more efficient algorithm would fail to generate correct results under these circumstances.
The above discussion ought, with a bit of thought, indicate why an Aggregator stage is intolerant of NULL.
The best way to understand the Aggregator stage is to do a bit of role playing. Get a whiteboard, or a large piece of paper, and actually pretend that you are the Aggregator stage doing its thing. For each row, you need a bucket containing each grouping column and each aggregating column. If you have no a priori knowledge about the data, you must keep adding buckets as new combinations of grouping column values arrive. If you do have knowledge that a particular groupng column value will not arrive again, then you can record the results from, then wipe off those buckets and re-use the whiteboard space.
If the incoming data are sorted (by grouping column(s)) and you assert this on the Aggregator's input link, then far less memory can be consumed and speed obtained.
Normal "idiot proofing" means that a code check must occur that no row is out of order in a grouping column asserted to be sorted, because the more efficient algorithm would fail to generate correct results under these circumstances.
The above discussion ought, with a bit of thought, indicate why an Aggregator stage is intolerant of NULL.
The best way to understand the Aggregator stage is to do a bit of role playing. Get a whiteboard, or a large piece of paper, and actually pretend that you are the Aggregator stage doing its thing. For each row, you need a bucket containing each grouping column and each aggregating column. If you have no a priori knowledge about the data, you must keep adding buckets as new combinations of grouping column values arrive. If you do have knowledge that a particular groupng column value will not arrive again, then you can record the results from, then wipe off those buckets and re-use the whiteboard space.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Wow... I hope all of that was for the home audience, 'cuz I really do understand how the dreaded thing works. Well, expect for the 'in memory versus temporary file storage' bit which started this all off.
I know the whole 'buckets' thing and how it can (in essence) use one bucket when the input data is sorted in a manner that supports the aggregation and you assert that in the stage. My question still stands, though...
For arguments sake, let's assume I can't have the O/S or DB or something like CoSort do the sorting - I can't arrange for the data to arrive in the order that I need. All I have to work with are DataStage stages. Given all of the impressions out there of the relative slowness of the Sort stage, I'm wondering which is the more 'performant' option - presorting the data using the Sort stage, then feeding the Aggregator (with all of the proper assertions asserted, of course) or simply allowing the Aggregator to go hog wild and start its whole bucket brigade thing?
I generally stick with a stand-alone Aggregator for small data volumes, and I've used it (seeminly) just fine for several million 'unsorted' rows. I know there are numerous factors involved, just wondering if there is a GAROT here somewhere - a Generally Accepted Rule of Thumb.
Thanks!
For arguments sake, let's assume I can't have the O/S or DB or something like CoSort do the sorting - I can't arrange for the data to arrive in the order that I need. All I have to work with are DataStage stages. Given all of the impressions out there of the relative slowness of the Sort stage, I'm wondering which is the more 'performant' option - presorting the data using the Sort stage, then feeding the Aggregator (with all of the proper assertions asserted, of course) or simply allowing the Aggregator to go hog wild and start its whole bucket brigade thing?
I generally stick with a stand-alone Aggregator for small data volumes, and I've used it (seeminly) just fine for several million 'unsorted' rows. I know there are numerous factors involved, just wondering if there is a GAROT here somewhere - a Generally Accepted Rule of Thumb.
Thanks!
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
ray.wurlod
- Participant
- Posts: 54595
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
How about a "generally accepted rule of thumb to everyone"?
There's none that I know of. Theoretically the only upper limit is when the Aggregator stage runs out of allocable memory, but when that is depends on the individual operating system and on what else is happening (demanding memory) at the time.
Most of the abort situations I've seen with the Aggregator have been to do with the inability to promote data types (for example, it doesn't need too many smallints in a SUM to need an integer to store the result), someone not getting the metadata right (causing an access violation (SIGBUS) to be raised), and attempting to perform a set operation on NULL (oh dear).
Sometimes the grouping column is the result of a lookup, which you wouldn't have until the lookup had been performed. As you suggest, not everyone has the CoSort plug-in stage, so where to from here? The choices seem to be to use the DataStage Sort stage (not fast, and consumes memory), or to drive the Aggregator stage with unsorted data.
I'd opt for the latter, possibly tuning memory allocation through DS.TOOLS if I knew that the data volumes would be large. And I would thoroughly test it before putting it in production, possibly with a double-normal sized data set.
There's none that I know of. Theoretically the only upper limit is when the Aggregator stage runs out of allocable memory, but when that is depends on the individual operating system and on what else is happening (demanding memory) at the time.
Most of the abort situations I've seen with the Aggregator have been to do with the inability to promote data types (for example, it doesn't need too many smallints in a SUM to need an integer to store the result), someone not getting the metadata right (causing an access violation (SIGBUS) to be raised), and attempting to perform a set operation on NULL (oh dear).
Sometimes the grouping column is the result of a lookup, which you wouldn't have until the lookup had been performed. As you suggest, not everyone has the CoSort plug-in stage, so where to from here? The choices seem to be to use the DataStage Sort stage (not fast, and consumes memory), or to drive the Aggregator stage with unsorted data.
I'd opt for the latter, possibly tuning memory allocation through DS.TOOLS if I knew that the data volumes would be large. And I would thoroughly test it before putting it in production, possibly with a double-normal sized data set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
