Performance tuning with EtlStats reports

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Performance tuning with EtlStats reports

Post by kduke »

Your ETL took longer today than normal. Your boss asks why. Was it more data being processed, network traffic or some other issue. How do you find out what it was? We had this issue this week. This is how I solved it.

What is the average number of rows for each job? This varies by weekday. So what is the average number of rows for a Tuesday?

Code: Select all

      select 
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME,
         max(START_TS) as MAX_START_TS,
         avg(ROW_COUNT) as AVG_ROW_COUNT,
         stdev(ROW_COUNT) as STD_ROW_COUNT,
         avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
         stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
      from
         ETL_ROW_HIST
      where
         datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
      group by
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

This is SQL Server syntax. If you figure this out for Oracle or another database then please post it. getdate() is the current date. datepart(DW, getdate()) will get the day of the week for today. stddev() is the standard deviation for the current avg(). Most numbers should be within one standard deviation of the average.

Next how do you use this information?

You need to find all jobs where the number of rows was greater then the avg(number of rows) + stdev(number of rows).

Code: Select all

select 
   LAST_RUN.JOB_NAME,
   LAST_RUN.LINK_NAME,
   LAST_RUN.START_TS,
   LAST_RUN.ELAPSED_TIME,
   LAST_RUN.ROWS_PER_SEC,
   ROWS_MAX.AVG_ROWS_PER_SEC,
   convert(decimal(18,0),ROWS_MAX.STD_ROWS_PER_SEC) as STD_ROWS_PER_SEC,
   LAST_RUN.ROW_COUNT,
   ROWS_MAX.AVG_ROW_COUNT,
   convert(decimal(18,0),ROWS_MAX.STD_ROW_COUNT) as STD_ROW_COUNT
from
   ETL_ROW_HIST LAST_RUN,
   (
      select
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME,
         max(START_TS) as MAX_START_TS,
         avg(ROW_COUNT) as AVG_ROW_COUNT,
         stdev(ROW_COUNT) as STD_ROW_COUNT,
         avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
         stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
      from
         ETL_ROW_HIST
      where
         datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
      group by
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME
   ) ROWS_MAX,
  ETL_JOB
where
   LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
   LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
   LAST_RUN.PROJECT_NAME = ROWS_MAX.PROJECT_NAME and
   LAST_RUN.JOB_NAME = ROWS_MAX.JOB_NAME and
   LAST_RUN.LINK_NAME = ROWS_MAX.LINK_NAME and
   LAST_RUN.START_TS = ROWS_MAX.MAX_START_TS and
   LAST_RUN.ROW_COUNT > (ROWS_MAX.AVG_ROW_COUNT + ROWS_MAX.STD_ROW_COUNT)
order by
   ETL_JOB.JOB_CATEGORY,
   LAST_RUN.JOB_NAME,
   LAST_RUN.LINK_NAME,
   LAST_RUN.START_TS
;
Last edited by kduke on Wed Jun 21, 2006 2:34 pm, edited 1 time in total.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

The problem is I only want jobs which ran today. I only want jobs which ran for 5 minutes or more. What difference does it make on small jobs. I want to list the important jobs only. I also want jobs which are running slower based on rows per second. Either could be a problem.

Code: Select all

/* ----------------------------------------------------------------- 
   MoreRowsThanNormal.sql
   Decription: Returns list of Jobs.
   Written by: Kim Duke
   Notes:
   This shows the last run of the a job compared to the average plus
   one standard deviation away. The job needs to run more than 5 minutes
   to be considered. We want to know if rows per second decreases or
   the number of rows increases significantly.
   ----------------------------------------------------------------- */
select
   LAST_RUN.JOB_NAME,
   LAST_RUN.LINK_NAME,
   LAST_RUN.START_TS,
   LAST_RUN.ELAPSED_TIME,
   LAST_RUN.ROWS_PER_SEC,
   ROWS_MAX.AVG_ROWS_PER_SEC,
   convert(decimal(18,0),ROWS_MAX.STD_ROWS_PER_SEC) as STD_ROWS_PER_SEC,
   LAST_RUN.ROW_COUNT,
   ROWS_MAX.AVG_ROW_COUNT,
   convert(decimal(18,0),ROWS_MAX.STD_ROW_COUNT) as STD_ROW_COUNT
from
   ETL_ROW_HIST LAST_RUN,
   (
      select
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME,
         max(START_TS) as MAX_START_TS,
         avg(ROW_COUNT) as AVG_ROW_COUNT,
         stdev(ROW_COUNT) as STD_ROW_COUNT,
         avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
         stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
      from
         ETL_ROW_HIST
      where
         datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
      group by
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME
   ) ROWS_MAX,
  ETL_JOB
where
   LAST_RUN.START_TS > cast(convert(varchar, getdate(), 101) as datetime) and
   LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
   LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
   LAST_RUN.PROJECT_NAME = ROWS_MAX.PROJECT_NAME and
   LAST_RUN.JOB_NAME = ROWS_MAX.JOB_NAME and
   LAST_RUN.LINK_NAME = ROWS_MAX.LINK_NAME and
   LAST_RUN.START_TS = ROWS_MAX.MAX_START_TS and
   LAST_RUN.ELAPSED_SEC > 300 and
   (
      LAST_RUN.ROW_COUNT > (ROWS_MAX.AVG_ROW_COUNT + ROWS_MAX.STD_ROW_COUNT)
      or LAST_RUN.ROWS_PER_SEC < (ROWS_MAX.AVG_ROWS_PER_SEC - ROWS_MAX.STD_ROWS_PER_SEC)
   )
order by
   ETL_JOB.JOB_CATEGORY,
   LAST_RUN.JOB_NAME,
   LAST_RUN.LINK_NAME,
   LAST_RUN.START_TS
;
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Maybe all you want to know is which sequences ran long.

Code: Select all

/* ----------------------------------------------------------------- 
   SeqRunningLong2.sql
   Decription: Returns list of Jobs.
   Written by: Kim Duke
   Notes:
   This shows the last run of the a job compared to the average plus
   one standard deviation away.
   ----------------------------------------------------------------- */
select
   LAST_RUN.JOB_NAME,
   LAST_RUN.START_TS,
   max(LAST_RUN.ELAPSED_TIME) as ELAPSED_TIME,
   max(LAST_RUN.ELAPSED_SEC) as ELAPSED_SEC,
   max(MAX_RUN.AVG_ELAPSED_SEC) as AVG_ELAPSED_SEC,
   convert(decimal(18,0),max(MAX_RUN.STD_ELAPSED_SEC)) as STD_ELAPSED_SEC
from
   ETL_JOB_HIST LAST_RUN,
   (
      select
         PROJECT_NAME,
         JOB_NAME,
         max(START_TS) as MAX_START_TS,
         avg(ELAPSED_SEC) as AVG_ELAPSED_SEC,
         stdev(ELAPSED_SEC) as STD_ELAPSED_SEC
      from
         ETL_JOB_HIST
      where
         datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
      group by
         PROJECT_NAME,
         JOB_NAME
   ) MAX_RUN,
  ETL_JOB
where
   ETL_JOB.SEQ_IND = 'Y' and
   LAST_RUN.START_TS > cast(convert(varchar, getdate(), 101) as datetime) and
   LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
   LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
   LAST_RUN.PROJECT_NAME = MAX_RUN.PROJECT_NAME and
   LAST_RUN.JOB_NAME = MAX_RUN.JOB_NAME and
   LAST_RUN.START_TS = MAX_RUN.MAX_START_TS and
   -- LAST_RUN.JOB_NAME like 'Customer%' and
   LAST_RUN.ELAPSED_SEC > 300 and
   LAST_RUN.ELAPSED_SEC > (MAX_RUN.AVG_ELAPSED_SEC + MAX_RUN.STD_ELAPSED_SEC)
group by
   LAST_RUN.JOB_NAME,
   LAST_RUN.START_TS
order by
   LAST_RUN.JOB_NAME,
   LAST_RUN.START_TS
;
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Should post a new version of EtlStats soon. I cleaned up a lot of stuff.
Mamu Kim
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Kim. You rule. Thank you so much for sharing these jewels with us. :P
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I posted a new EtlStats on my tips page. If you find any errors then let me know. It should be easier to install now. The installation instructions are included. I do not included the html documents or the compiled jobs. So the zip file is a lot smaller. I created categories for all the jobs which is a lot cleaner now. The same for the routines. I think there is 90 routines and not sure how many jobs.

The category ETL_Metadata has jobs which extract a lot of the job metadata like table names and other information I think is valuable. I also extract table names and column names from DB2 and SQL Server system tables. This is a poor mans version of MetaStage. I plan on adding a lot more jobs of this nature.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

The new EtlStats includes a few jobs which extract ETL metadata like the table names associated with the link name. This is only for output links but this becomes very useful. You can now tie row counts to table names. Here is the SQL. You need to find the right jobs to run to populate the tables. The DDL to create the tables is in the zip file. I have all create tables for SQL Server and some for Oracle. You can figure out how to convert to your database.

Code: Select all

/* ----------------------------------------------------------------- 
   EtlRowsByTableName.sql
   Decription: Reports row counts by table, job and link name.
   Written by: Kim Duke
   Notes:
   Need to run several jobs in EtlStats to get this information.
   ----------------------------------------------------------------- */
Select
   ETL_ROW_HIST.JOB_NAME,
   ETL_ROW_HIST.LINK_NAME,
   case
   when IsNull(ETL_MD_TABLE.TABLE_NAME,'') <> ''
   then ETL_MD_TABLE.TABLE_NAME
   else IsNull(ETL_MD_PROPERTY_TABLE.TABLE_NAME,'')
   end as TABLE_NAME,
   ETL_ROW_HIST.ROW_COUNT
From
   ETL_ROW_HIST
Inner Join
   (
      select
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME,
         max(START_TS) as MAX_START_TS
      from
         ETL_ROW_HIST
      group by
         PROJECT_NAME,
         JOB_NAME,
         LINK_NAME
   ) LAST_RUN
On
   LAST_RUN.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
   and LAST_RUN.JOB_NAME = ETL_ROW_HIST.JOB_NAME
   and LAST_RUN.LINK_NAME = ETL_ROW_HIST.LINK_NAME
   and LAST_RUN.MAX_START_TS = ETL_ROW_HIST.START_TS
Left Outer Join
   ETL_MD_TABLE
On
   ETL_MD_TABLE.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
   and ETL_MD_TABLE.JOB_NAME = ETL_ROW_HIST.JOB_NAME
   and ETL_MD_TABLE.LINK_NAME = ETL_ROW_HIST.LINK_NAME
Left Outer Join
   ETL_MD_PROPERTY_TABLE
On
   ETL_MD_PROPERTY_TABLE.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
   and ETL_MD_PROPERTY_TABLE.JOB_NAME = ETL_ROW_HIST.JOB_NAME
   and ETL_MD_PROPERTY_TABLE.LINK_NAME = ETL_ROW_HIST.LINK_NAME
Where
   case
   when IsNull(ETL_MD_TABLE.TABLE_NAME,'') <> ''
   then ETL_MD_TABLE.TABLE_NAME
   else IsNull(ETL_MD_PROPERTY_TABLE.TABLE_NAME,'')
   end <> ''
Order By
   1,2,3
;
Mamu Kim
Post Reply