collecting top 200 upc

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

collecting top 200 upc

Post by pavan_test »

Hello Everyone.,


i have a file ( it is a retailer file ). say costco. the file has the information about the various units sold in the different stores belong to costco across USA.
eg: the file is costco

store location: upc sold
denver 500
dallas 10000
houston 100
miami 5000
atlanta 700
california 3500

the size of the file is around 1 million records. in the file i am concerned only about the above 2 columns and i am not concerned about other data.

how do i pick up top 200 upc sold from the file. can anyone please help me in the process.

initially picking up top 200 upc can be hard coded but later on it should be a configurable variable.

thanks
pavan
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Generic solution - sort descending by upc sold. A job parameter for the threshold can then pull the first X records out of the stream afterwards via a constraint as the 'top X'... unless you need to worry about ties. In that case stage variables could be employed to check for changes and increment only when the value changes, tracking the 'rank' accordingly.
-craig

"You can never have too many knives" -- Logan Nine Fingers
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

selecting top 200 upc

Post by pavan_test »

[quote="chulett"]Generic solution - sort descending by upc sold. A job parameter for the threshold can then pull the first X records out of the stream afterwards via a constraint as the 'top X'... unless you need to worry about ties. In that case stage variables could be employed to check for changes and increment only when the value changes, tracking the 'rank' accordingly.[/quote]

Sorry, my fault, i did not phrase my question correctly ....
In the above example, let's say from denver, i have a lot of upc(items) starting from 500 and are sorted in descending order. The next (venue) is dallas and say again i have some UPC's in descending order, my goal is to pick the top 200 from every venue(dallas, denver, miami.,etc)

for eg
Venue: upc sold
denver 500
denver 300
denver 200
..
..
dallas 10000
dallas 800
dallas 600
dallas 300
..
..
..

houston 100
miami 5000
atlanta 700
california 3500

Similarly for all other venues i have a descending order of UPC's, i cannot use an external filter stage since in the stage if i pass a unix command of head -200 i will get the top 200 from the entire file, but not from every venue. My idea is to get the top 200 items(UPC) from each and every venue and the number of venues in my file will be atleast 5000...
I cannot think of a way of doing it, can you suggest anything...please

Thanks
pawan
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Pavan,

In that case, you can still use Craig's solution, but instead of just sorting on UPC alone, you need to sort on both venue column and then the UPC column.
Thanks,
Naveen
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Change your expectations. DataStage is NOT a reporting tool.

Most business intelligence tools have this kind of capability out of the box. Some databases have a RANK function.

If your client insists, resist. Tell them that ETL and reporting are different. They are wrong.

Further, as soon as you move to more than one processing node, how will you ensure that your ranks are being calculated correctly? You will have to sort/hash on the grouping columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply