Calculating top vaues

kumar_s · Post by **kumar_s** » Mon Mar 12, 2007 8:32 pm

There are many suggestion given with different functionalities.

pravin1581 · Post by **pravin1581** » Mon Mar 12, 2007 11:35 pm

kumar_s wrote:Bit confused again.
If you file has 100 rows with column a,b,c, do you expect first 10rows or you expect first 10 values of a,b,c. Which in case may not be the same as first 10 rows.

Absolutely I want the first 10% values and not the first 10 rows.

pravin1581 · Post by **pravin1581** » Mon Mar 12, 2007 11:40 pm

DSguru2B wrote:
pravin1581 wrote: It is top 10% of a particular column in a file.
You need to sort these values ascending, line them up in a single row, get the max value (last value), calculate the 10 percent breaking point, filter all the records equal to or less than that breaking point.
If we are talking about Top 10% on values then you have to follow the method I gave above. If its Top 10% positional values, then all you need to do is get the number of records, sort again and get the 10% of the number of records. Get all records before that breaking point.

How can the values be lined in a single row?

pravin1581 · Post by **pravin1581** » Mon Mar 12, 2007 11:44 pm

chulett wrote:
pravin1581 wrote:
chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.

How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.

kumar_s · Post by **kumar_s** » Mon Mar 12, 2007 11:49 pm

pravin1581 wrote:
chulett wrote:
pravin1581 wrote: I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.

Quoted post it self has the required information. 'wc -l' the suggested approach to find the row count before the run.
Could you pleas post some input data sample and what you expect from that.

pravin1581 · Post by **pravin1581** » Tue Mar 13, 2007 3:36 am

kumar_s wrote:
pravin1581 wrote:
chulett wrote: Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.
How can we get the record count before running the job . My requirement is the row count after sorting the data. The @OUTROWNUM function is not returning the row count.
Quoted post it self has the required information. 'wc -l' the suggested approach to find the row count before the run.
Could you pleas post some input data sample and what you expect from that.

Suppose the values are 2,4,6,8,10,10,11,11,12,13. There are 8 distinct values and I want to extract out top 10% values. The reqd value is 13.

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 4:53 am

In case if you input is 2,4,6,8,10,10,11,11,12,13,2,4,6,8,10,10,11,11,12,13 then you need to get 13,13 or 12,13?
i.e., only distinct values or just the top values found in data.
Do wc -l of you file in a ExecuteCommand activity (which will give you total record count of the file), and pass the Command.Output of the Execute.Command Activity as parameter to the job.
Read the file, and sort it. In transformer, restrict the rows using @INROWNUM <= Input.paramter/10.
But this logic will vary based on your reply to this.

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 4:54 am

It not really required to quote the whole message. I did, because, just to let you know that the answer is given in the Quote message.

DSguru2B · Post by **DSguru2B** » Tue Mar 13, 2007 7:09 am

Wow, to many nested quotes. Anywho...
How is your data coming in. Its not hard to line them up. You can do it at the unix level using awk or in datastage.
Line them up, get the percentage value to find out how many positions come under top 10% and get those many items. Its a few lines of code added on to the other median code I provided to you.

pravin1581 · Post by **pravin1581** » Tue Mar 13, 2007 10:47 pm

kumar_s wrote:In case if you input is 2,4,6,8,10,10,11,11,12,13,2,4,6,8,10,10,11,11,12,13 then you need to get 13,13 or 12,13?
i.e., only distinct values or just the top values found in data.
Do wc -l of you file in a ExecuteCommand activity (which will give you total record count of the file), and pass the Command.Output of the Execute.Command Activity as parameter to the job.
Read the file, and sort it. In transformer, restrict the rows using @INROWNUM <= Input.paramter/10.
But this logic will vary based on your reply to this.

We want only the distinct top values. So the data needs to be sorted and distinct.

kumar_s · Post by **kumar_s** » Wed Mar 14, 2007 2:14 am

You might have got the overall picture by this time. To eliminate the duplicates based on a key, you can use "uniq" command from unix, or you can sort and use Transformer stage variables to eliminate it.

DSguru2B · Post by **DSguru2B** » Wed Mar 14, 2007 6:15 am

Or load it in a hashed file, the duplicates will be eliminated by itself. Use the sql to get the particular column in sorted format, then do your count and top 10 percent calculation.

pravin1581 · Post by **pravin1581** » Wed Mar 14, 2007 7:28 am

DSguru2B wrote:Or load it in a hashed file, the duplicates will be eliminated by itself. Use the sql to get the particular column in sorted format, then do your count and top 10 percent calculation.

How can we run the SQL on a file ?

DSguru2B · Post by **DSguru2B** » Wed Mar 14, 2007 7:28 am

On hashed file. How you are doing for median in the routine.

chulett · Post by **chulett** » Wed Mar 14, 2007 7:33 am

pravin1581 wrote:How can we run the SQL on a file ?

Or simply load your file into a 'work' table in your database of choice.