Page 1 of 1
Performance tuning with EtlStats reports
Posted: Wed Jun 21, 2006 2:20 pm
by kduke
Your ETL took longer today than normal. Your boss asks why. Was it more data being processed, network traffic or some other issue. How do you find out what it was? We had this issue this week. This is how I solved it.
What is the average number of rows for each job? This varies by weekday. So what is the average number of rows for a Tuesday?
Code: Select all
select
PROJECT_NAME,
JOB_NAME,
LINK_NAME,
max(START_TS) as MAX_START_TS,
avg(ROW_COUNT) as AVG_ROW_COUNT,
stdev(ROW_COUNT) as STD_ROW_COUNT,
avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
from
ETL_ROW_HIST
where
datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
group by
PROJECT_NAME,
JOB_NAME,
LINK_NAME
Posted: Wed Jun 21, 2006 2:29 pm
by kduke
This is SQL Server syntax. If you figure this out for Oracle or another database then please post it. getdate() is the current date. datepart(DW, getdate()) will get the day of the week for today. stddev() is the standard deviation for the current avg(). Most numbers should be within one standard deviation of the average.
Next how do you use this information?
You need to find all jobs where the number of rows was greater then the avg(number of rows) + stdev(number of rows).
Code: Select all
select
LAST_RUN.JOB_NAME,
LAST_RUN.LINK_NAME,
LAST_RUN.START_TS,
LAST_RUN.ELAPSED_TIME,
LAST_RUN.ROWS_PER_SEC,
ROWS_MAX.AVG_ROWS_PER_SEC,
convert(decimal(18,0),ROWS_MAX.STD_ROWS_PER_SEC) as STD_ROWS_PER_SEC,
LAST_RUN.ROW_COUNT,
ROWS_MAX.AVG_ROW_COUNT,
convert(decimal(18,0),ROWS_MAX.STD_ROW_COUNT) as STD_ROW_COUNT
from
ETL_ROW_HIST LAST_RUN,
(
select
PROJECT_NAME,
JOB_NAME,
LINK_NAME,
max(START_TS) as MAX_START_TS,
avg(ROW_COUNT) as AVG_ROW_COUNT,
stdev(ROW_COUNT) as STD_ROW_COUNT,
avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
from
ETL_ROW_HIST
where
datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
group by
PROJECT_NAME,
JOB_NAME,
LINK_NAME
) ROWS_MAX,
ETL_JOB
where
LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
LAST_RUN.PROJECT_NAME = ROWS_MAX.PROJECT_NAME and
LAST_RUN.JOB_NAME = ROWS_MAX.JOB_NAME and
LAST_RUN.LINK_NAME = ROWS_MAX.LINK_NAME and
LAST_RUN.START_TS = ROWS_MAX.MAX_START_TS and
LAST_RUN.ROW_COUNT > (ROWS_MAX.AVG_ROW_COUNT + ROWS_MAX.STD_ROW_COUNT)
order by
ETL_JOB.JOB_CATEGORY,
LAST_RUN.JOB_NAME,
LAST_RUN.LINK_NAME,
LAST_RUN.START_TS
;
Posted: Wed Jun 21, 2006 2:34 pm
by kduke
The problem is I only want jobs which ran today. I only want jobs which ran for 5 minutes or more. What difference does it make on small jobs. I want to list the important jobs only. I also want jobs which are running slower based on rows per second. Either could be a problem.
Code: Select all
/* -----------------------------------------------------------------
MoreRowsThanNormal.sql
Decription: Returns list of Jobs.
Written by: Kim Duke
Notes:
This shows the last run of the a job compared to the average plus
one standard deviation away. The job needs to run more than 5 minutes
to be considered. We want to know if rows per second decreases or
the number of rows increases significantly.
----------------------------------------------------------------- */
select
LAST_RUN.JOB_NAME,
LAST_RUN.LINK_NAME,
LAST_RUN.START_TS,
LAST_RUN.ELAPSED_TIME,
LAST_RUN.ROWS_PER_SEC,
ROWS_MAX.AVG_ROWS_PER_SEC,
convert(decimal(18,0),ROWS_MAX.STD_ROWS_PER_SEC) as STD_ROWS_PER_SEC,
LAST_RUN.ROW_COUNT,
ROWS_MAX.AVG_ROW_COUNT,
convert(decimal(18,0),ROWS_MAX.STD_ROW_COUNT) as STD_ROW_COUNT
from
ETL_ROW_HIST LAST_RUN,
(
select
PROJECT_NAME,
JOB_NAME,
LINK_NAME,
max(START_TS) as MAX_START_TS,
avg(ROW_COUNT) as AVG_ROW_COUNT,
stdev(ROW_COUNT) as STD_ROW_COUNT,
avg(ROWS_PER_SEC) as AVG_ROWS_PER_SEC,
stdev(ROWS_PER_SEC) as STD_ROWS_PER_SEC
from
ETL_ROW_HIST
where
datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
group by
PROJECT_NAME,
JOB_NAME,
LINK_NAME
) ROWS_MAX,
ETL_JOB
where
LAST_RUN.START_TS > cast(convert(varchar, getdate(), 101) as datetime) and
LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
LAST_RUN.PROJECT_NAME = ROWS_MAX.PROJECT_NAME and
LAST_RUN.JOB_NAME = ROWS_MAX.JOB_NAME and
LAST_RUN.LINK_NAME = ROWS_MAX.LINK_NAME and
LAST_RUN.START_TS = ROWS_MAX.MAX_START_TS and
LAST_RUN.ELAPSED_SEC > 300 and
(
LAST_RUN.ROW_COUNT > (ROWS_MAX.AVG_ROW_COUNT + ROWS_MAX.STD_ROW_COUNT)
or LAST_RUN.ROWS_PER_SEC < (ROWS_MAX.AVG_ROWS_PER_SEC - ROWS_MAX.STD_ROWS_PER_SEC)
)
order by
ETL_JOB.JOB_CATEGORY,
LAST_RUN.JOB_NAME,
LAST_RUN.LINK_NAME,
LAST_RUN.START_TS
;
Posted: Wed Jun 21, 2006 2:37 pm
by kduke
Maybe all you want to know is which sequences ran long.
Code: Select all
/* -----------------------------------------------------------------
SeqRunningLong2.sql
Decription: Returns list of Jobs.
Written by: Kim Duke
Notes:
This shows the last run of the a job compared to the average plus
one standard deviation away.
----------------------------------------------------------------- */
select
LAST_RUN.JOB_NAME,
LAST_RUN.START_TS,
max(LAST_RUN.ELAPSED_TIME) as ELAPSED_TIME,
max(LAST_RUN.ELAPSED_SEC) as ELAPSED_SEC,
max(MAX_RUN.AVG_ELAPSED_SEC) as AVG_ELAPSED_SEC,
convert(decimal(18,0),max(MAX_RUN.STD_ELAPSED_SEC)) as STD_ELAPSED_SEC
from
ETL_JOB_HIST LAST_RUN,
(
select
PROJECT_NAME,
JOB_NAME,
max(START_TS) as MAX_START_TS,
avg(ELAPSED_SEC) as AVG_ELAPSED_SEC,
stdev(ELAPSED_SEC) as STD_ELAPSED_SEC
from
ETL_JOB_HIST
where
datepart(DW, getdate()) = datepart(DW, convert(datetime, START_TS))
group by
PROJECT_NAME,
JOB_NAME
) MAX_RUN,
ETL_JOB
where
ETL_JOB.SEQ_IND = 'Y' and
LAST_RUN.START_TS > cast(convert(varchar, getdate(), 101) as datetime) and
LAST_RUN.PROJECT_NAME = ETL_JOB.PROJECT_NAME and
LAST_RUN.JOB_NAME = ETL_JOB.JOB_NAME and
LAST_RUN.PROJECT_NAME = MAX_RUN.PROJECT_NAME and
LAST_RUN.JOB_NAME = MAX_RUN.JOB_NAME and
LAST_RUN.START_TS = MAX_RUN.MAX_START_TS and
-- LAST_RUN.JOB_NAME like 'Customer%' and
LAST_RUN.ELAPSED_SEC > 300 and
LAST_RUN.ELAPSED_SEC > (MAX_RUN.AVG_ELAPSED_SEC + MAX_RUN.STD_ELAPSED_SEC)
group by
LAST_RUN.JOB_NAME,
LAST_RUN.START_TS
order by
LAST_RUN.JOB_NAME,
LAST_RUN.START_TS
;
Posted: Wed Jun 21, 2006 2:39 pm
by kduke
Should post a new version of EtlStats soon. I cleaned up a lot of stuff.
Posted: Thu Jun 22, 2006 6:54 am
by DSguru2B
Kim. You rule. Thank you so much for sharing these jewels with us.

Posted: Tue Jul 04, 2006 8:40 pm
by kduke
I posted a new EtlStats on my tips page. If you find any errors then let me know. It should be easier to install now. The installation instructions are included. I do not included the html documents or the compiled jobs. So the zip file is a lot smaller. I created categories for all the jobs which is a lot cleaner now. The same for the routines. I think there is 90 routines and not sure how many jobs.
The category ETL_Metadata has jobs which extract a lot of the job metadata like table names and other information I think is valuable. I also extract table names and column names from DB2 and SQL Server system tables. This is a poor mans version of MetaStage. I plan on adding a lot more jobs of this nature.
Posted: Sat Jul 08, 2006 9:29 am
by kduke
The new EtlStats includes a few jobs which extract ETL metadata like the table names associated with the link name. This is only for output links but this becomes very useful. You can now tie row counts to table names. Here is the SQL. You need to find the right jobs to run to populate the tables. The DDL to create the tables is in the zip file. I have all create tables for SQL Server and some for Oracle. You can figure out how to convert to your database.
Code: Select all
/* -----------------------------------------------------------------
EtlRowsByTableName.sql
Decription: Reports row counts by table, job and link name.
Written by: Kim Duke
Notes:
Need to run several jobs in EtlStats to get this information.
----------------------------------------------------------------- */
Select
ETL_ROW_HIST.JOB_NAME,
ETL_ROW_HIST.LINK_NAME,
case
when IsNull(ETL_MD_TABLE.TABLE_NAME,'') <> ''
then ETL_MD_TABLE.TABLE_NAME
else IsNull(ETL_MD_PROPERTY_TABLE.TABLE_NAME,'')
end as TABLE_NAME,
ETL_ROW_HIST.ROW_COUNT
From
ETL_ROW_HIST
Inner Join
(
select
PROJECT_NAME,
JOB_NAME,
LINK_NAME,
max(START_TS) as MAX_START_TS
from
ETL_ROW_HIST
group by
PROJECT_NAME,
JOB_NAME,
LINK_NAME
) LAST_RUN
On
LAST_RUN.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
and LAST_RUN.JOB_NAME = ETL_ROW_HIST.JOB_NAME
and LAST_RUN.LINK_NAME = ETL_ROW_HIST.LINK_NAME
and LAST_RUN.MAX_START_TS = ETL_ROW_HIST.START_TS
Left Outer Join
ETL_MD_TABLE
On
ETL_MD_TABLE.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
and ETL_MD_TABLE.JOB_NAME = ETL_ROW_HIST.JOB_NAME
and ETL_MD_TABLE.LINK_NAME = ETL_ROW_HIST.LINK_NAME
Left Outer Join
ETL_MD_PROPERTY_TABLE
On
ETL_MD_PROPERTY_TABLE.PROJECT_NAME = ETL_ROW_HIST.PROJECT_NAME
and ETL_MD_PROPERTY_TABLE.JOB_NAME = ETL_ROW_HIST.JOB_NAME
and ETL_MD_PROPERTY_TABLE.LINK_NAME = ETL_ROW_HIST.LINK_NAME
Where
case
when IsNull(ETL_MD_TABLE.TABLE_NAME,'') <> ''
then ETL_MD_TABLE.TABLE_NAME
else IsNull(ETL_MD_PROPERTY_TABLE.TABLE_NAME,'')
end <> ''
Order By
1,2,3
;