Calculating top vaues

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Calculating top vaues

Post by pravin1581 »

Hi all,

How can we extract out top 10 % value of any field in DataStage ?

thanx in advance .

Pravin
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage is not a reporting tool.

You do this in SQL. Hopefully in some variant of SQL that gives you some display functions such as RANK in its query language, and some means to restrict a query based on rank.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Smeitei
Participant
Posts: 28
Joined: Tue Jan 23, 2007 3:14 pm

Post by Smeitei »

What is your reporting tool . I think it is more of a reporting requirement. OLAP tool like MicroStrategy, BO will handle that very easily or write custom sql in DS stages to get it
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

Smeitei wrote:What is your reporting tool . I think it is more of a reporting requirement. OLAP tool like MicroStrategy, BO will handle that very easily or write custom sql in DS stages to get it
I need to extract the top 10% values from a flat file , hence a user defined SQL will be of no help.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You could have used Sample stage if you were in PX. Here you need to count the total number of rows using wc and use split command to read first 10%.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Of course you CAN use SQL - via an ODBC driver for text file. Just don't expect it to be fast. But for sufficiently small text files performance may be acceptable.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

If the top 10% is to be calculated from a bunch of values (column values) in a single row, then you can enhance the functionlity that's been provided in your median query.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Pravin,
Here the response given are for two different purposes.
Is it the top 10% of rows from the file or top 10% of values from each field.
Pls do clarify.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

You can write a script to find the total number of lines in the field in question.
Calculate the 10 % from that field.
You might have to use some sed or awk for this.
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

kumar_s wrote:Hi Pravin,
Here the response given are for two different purposes.
Is it the top 10% of rows from the file or top 10% of values from each field.
Pls do clarify.
It is top 10% of a particular column in a file.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Bit confused again.
If you file has 100 rows with column a,b,c, do you expect first 10rows or you expect first 10 values of a,b,c. Which in case may not be the same as first 10 rows.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

pravin1581 wrote: It is top 10% of a particular column in a file.
You need to sort these values ascending, line them up in a single row, get the max value (last value), calculate the 10 percent breaking point, filter all the records equal to or less than that breaking point.
If we are talking about Top 10% on values then you have to follow the method I gave above. If its Top 10% positional values, then all you need to do is get the number of records, sort again and get the 10% of the number of records. Get all records before that breaking point.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

pravin1581 wrote:
chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply