Calculating top vaues

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

There are many suggestion given with different functionalities. :roll:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

kumar_s wrote:Bit confused again.
If you file has 100 rows with column a,b,c, do you expect first 10rows or you expect first 10 values of a,b,c. Which in case may not be the same as first 10 rows.
Absolutely I want the first 10% values and not the first 10 rows.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

DSguru2B wrote:
pravin1581 wrote: It is top 10% of a particular column in a file.
You need to sort these values ascending, line them up in a single row, get the max value (last value), calculate the 10 percent breaking point, filter all the records equal to or less than that breaking point.
If we are talking about Top 10% on values then you have to follow the method I gave above. If its Top 10% positional values, then all you need to do is get the number of records, sort again and get the 10% of the number of records. Get all records before that breaking point.
How can the values be lined in a single row?
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

chulett wrote:
pravin1581 wrote:
chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

pravin1581 wrote:
chulett wrote:
pravin1581 wrote: I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.
Quoted post it self has the required information. 'wc -l' the suggested approach to find the row count before the run.
Could you pleas post some input data sample and what you expect from that.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

kumar_s wrote:
pravin1581 wrote:
chulett wrote: Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.
Quoted post it self has the required information. 'wc -l' the suggested approach to find the row count before the run.
Could you pleas post some input data sample and what you expect from that.
Suppose the values are 2,4,6,8,10,10,11,11,12,13. There are 8 distinct values and I want to extract out top 10% values. The reqd value is 13.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

In case if you input is 2,4,6,8,10,10,11,11,12,13,2,4,6,8,10,10,11,11,12,13 then you need to get 13,13 or 12,13?
i.e., only distinct values or just the top values found in data.
Do wc -l of you file in a ExecuteCommand activity (which will give you total record count of the file), and pass the Command.Output of the Execute.Command Activity as parameter to the job.
Read the file, and sort it. In transformer, restrict the rows using @INROWNUM <= Input.paramter/10.
But this logic will vary based on your reply to this.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

It not really required to quote the whole message. I did, because, just to let you know that the answer is given in the Quote message.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Wow, to many nested quotes. Anywho...
How is your data coming in. Its not hard to line them up. You can do it at the unix level using awk or in datastage.
Line them up, get the percentage value to find out how many positions come under top 10% and get those many items. Its a few lines of code added on to the other median code I provided to you.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

kumar_s wrote:In case if you input is 2,4,6,8,10,10,11,11,12,13,2,4,6,8,10,10,11,11,12,13 then you need to get 13,13 or 12,13?
i.e., only distinct values or just the top values found in data.
Do wc -l of you file in a ExecuteCommand activity (which will give you total record count of the file), and pass the Command.Output of the Execute.Command Activity as parameter to the job.
Read the file, and sort it. In transformer, restrict the rows using @INROWNUM <= Input.paramter/10.
But this logic will vary based on your reply to this.
We want only the distinct top values. So the data needs to be sorted and distinct.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You might have got the overall picture by this time. To eliminate the duplicates based on a key, you can use "uniq" command from unix, or you can sort and use Transformer stage variables to eliminate it.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Or load it in a hashed file, the duplicates will be eliminated by itself. Use the sql to get the particular column in sorted format, then do your count and top 10 percent calculation.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

DSguru2B wrote:Or load it in a hashed file, the duplicates will be eliminated by itself. Use the sql to get the particular column in sorted format, then do your count and top 10 percent calculation.
How can we run the SQL on a file ?
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

On hashed file. How you are doing for median in the routine.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

pravin1581 wrote:How can we run the SQL on a file ?
Or simply load your file into a 'work' table in your database of choice.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply