How to find out the bytes moved by a datastage job
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 32
- Joined: Fri Nov 07, 2014 11:18 am
How to find out the bytes moved by a datastage job
Hey
I need to see how many bytes are moved by the data stage jobs.
tried a few Environment variables
like cc_msg_level
and some APT variables , no luck.
The job logs just show the record counts.
Is there any way ?
I need to see how many bytes are moved by the data stage jobs.
tried a few Environment variables
like cc_msg_level
and some APT variables , no luck.
The job logs just show the record counts.
Is there any way ?
that's wierd ?
Bytes moved by a job will vary wildly with each different job design and could be meaningless (i.e. Job 1 moved 1GB from stage A to stage B and 900 MB from stage B to stage C and 900 from C to D, etc.).
Bytes moved across each link may be a useful measurement, but I do not know of an environment variable to set to learn this. I would curious to learn if there is one.
Bytes moved across each link may be a useful measurement, but I do not know of an environment variable to set to learn this. I would curious to learn if there is one.
Choose a job you love, and you will never have to work a day in your life. - Confucius
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I'm not aware of any either, other than manually calculating row sizes based on information in Chapter 2 of Parallel Job Developer's Guide about storage sizes of data types. And anything you put in to calculate the same would degrade the job's performance.
But, of course, any data moved around in parallel jobs uses buffers of various kinds, and of other kinds when re-partitioning. So you would need to round up to the nearest whole buffer size. For example writing even one row into a Data Set will move at least 32KB.
But, of course, any data moved around in parallel jobs uses buffers of various kinds, and of other kinds when re-partitioning. So you would need to round up to the nearest whole buffer size. For example writing even one row into a Data Set will move at least 32KB.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
It would be very interesting to know "why"?
What is the requirement for having such detail at the "byte" level? ...and then further, as Ray notes, is the question about "real" bytes, accurately measured uniquely for every variable record? ...internal memory representation? ...with buffers and whatever else internal needs to be applied? At the link level? Aggregated across all links and Stages?
Lots of possibilities.
Ernie
What is the requirement for having such detail at the "byte" level? ...and then further, as Ray notes, is the question about "real" bytes, accurately measured uniquely for every variable record? ...internal memory representation? ...with buffers and whatever else internal needs to be applied? At the link level? Aggregated across all links and Stages?
Lots of possibilities.
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
-
- Participant
- Posts: 32
- Joined: Fri Nov 07, 2014 11:18 am
Yea What Ray said was more or less a last resort..
The reason behind this exercise is to see the amount of data that we are moving using ETL. Management request
And it had me curious too
I generated a performance Report for a job and That had a chart which had a bytes used by the job (doesn't look like the same thing)
I was thinking of going about it using the database.
i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
long and winding
The reason behind this exercise is to see the amount of data that we are moving using ETL. Management request
And it had me curious too
I generated a performance Report for a job and That had a chart which had a bytes used by the job (doesn't look like the same thing)
I was thinking of going about it using the database.
i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
long and winding
that's wierd ?
What number of bytes - the full size according to the metadata? The Average Record Length derived from the data itself? Or by summing the true size of each record based on the data in each row? Then there's the potential difference of the data size in flight versus in storage.AshishDevassy wrote:I was thinking of going about it using the database. i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
Sounds like fun regardless...
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 32
- Joined: Fri Nov 07, 2014 11:18 am
An alternative outside of DataStage would be to request a network person to monitor inbound and outbound network traffic with your DataStage server over an average 24 hour period and send you a report with the number of bytes for each. There are probably operating system commands that would let you do the same thing yourself too.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Network traffic is not all data. It includes security and SQL statements passed back and forth. Bytes written is the amount of work being done. So it is better predictor of performance than rows / second.
In EtlStats is a routine which tries to estimate the average row length. Now most database s will calculate this when you run stats. They store it in the information schema. Once you get this length you still need to tie to the link in the job which write to this table. Getting table name, job name and link name in a table then average row length in a table then joining all that with row counts will give you the number. Not a simple task. How much benefit will you get from knowing these numbers? Will it pay for the effort to tie row length to job link name?
The EtlStats routine uses an estimate on varchar fields and other fields where length is either a max length or a packed field like numbers. For instance DECIMAL(18,2) does not take up 18 bytes. It is packed based on how the database works. VARCHAR(40) is based on average length. So in the routine you could hard code a percentage so all varchars would average 50% or 70% of max length. So at best it was an estimate. Average itself is an estimate of how long a row is. It is probably a better estimate than anything we could guess. So getting these out of information schema is probably best.
Oracle and DB2 I know have average row length. How it gets it you need to look up. So I think it is possible. Let us know if you do it and how you did it. Would be nice to see if it was worth it or you learned anything. Bytes per second should be consistent across all similar tables in a given database. Faster server should improve these numbers. So load times should be easier to predict in theory.
Good luck.
In EtlStats is a routine which tries to estimate the average row length. Now most database s will calculate this when you run stats. They store it in the information schema. Once you get this length you still need to tie to the link in the job which write to this table. Getting table name, job name and link name in a table then average row length in a table then joining all that with row counts will give you the number. Not a simple task. How much benefit will you get from knowing these numbers? Will it pay for the effort to tie row length to job link name?
The EtlStats routine uses an estimate on varchar fields and other fields where length is either a max length or a packed field like numbers. For instance DECIMAL(18,2) does not take up 18 bytes. It is packed based on how the database works. VARCHAR(40) is based on average length. So in the routine you could hard code a percentage so all varchars would average 50% or 70% of max length. So at best it was an estimate. Average itself is an estimate of how long a row is. It is probably a better estimate than anything we could guess. So getting these out of information schema is probably best.
Oracle and DB2 I know have average row length. How it gets it you need to look up. So I think it is possible. Let us know if you do it and how you did it. Would be nice to see if it was worth it or you learned anything. Bytes per second should be consistent across all similar tables in a given database. Faster server should improve these numbers. So load times should be easier to predict in theory.
Good luck.
Mamu Kim
-
- Participant
- Posts: 32
- Joined: Fri Nov 07, 2014 11:18 am
This is very possible.
We have a process that captures the DS logs, design and run information. This runs every few hours. The job is DS basic and captures 1500 job instance runs in about 10minutes (so it's pretty quick and not resource hungry)
We can then extrapolate details such as how much data (bytes, rows) have gone down each link and even to the level of how much has gone down each partition. This is very useful in identifying skewed PX jobs, jobs with high CPU and low row and gives you a first glance at what could be a focus point.
Most of these features are detailed in the BASIC documentation. I can see many benefits to having this.
Some of it needs to be guess work - i.e. data is stored and compressed in some instances, varchar fields may not always be 255!!! (we take an average). So while not an exact process it is a very good indicator and gives you a sense of what the environment is doing. This is great for capacity planning too.
We have a process that captures the DS logs, design and run information. This runs every few hours. The job is DS basic and captures 1500 job instance runs in about 10minutes (so it's pretty quick and not resource hungry)
We can then extrapolate details such as how much data (bytes, rows) have gone down each link and even to the level of how much has gone down each partition. This is very useful in identifying skewed PX jobs, jobs with high CPU and low row and gives you a first glance at what could be a focus point.
Most of these features are detailed in the BASIC documentation. I can see many benefits to having this.
Some of it needs to be guess work - i.e. data is stored and compressed in some instances, varchar fields may not always be 255!!! (we take an average). So while not an exact process it is a very good indicator and gives you a sense of what the environment is doing. This is great for capacity planning too.
- - - - -
Cranie
Doing the needful.
Cranie
Doing the needful.
-
- Participant
- Posts: 32
- Joined: Fri Nov 07, 2014 11:18 am