How to find out the bytes moved by a datastage job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
AshishDevassy
Participant
Posts: 32
Joined: Fri Nov 07, 2014 11:18 am

How to find out the bytes moved by a datastage job

Post by AshishDevassy »

Hey
I need to see how many bytes are moved by the data stage jobs.

tried a few Environment variables
like cc_msg_level
and some APT variables , no luck.
The job logs just show the record counts.

Is there any way ?
that's wierd ?
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Bytes moved by a job will vary wildly with each different job design and could be meaningless (i.e. Job 1 moved 1GB from stage A to stage B and 900 MB from stage B to stage C and 900 from C to D, etc.).

Bytes moved across each link may be a useful measurement, but I do not know of an environment variable to set to learn this. I would curious to learn if there is one.
Choose a job you love, and you will never have to work a day in your life. - Confucius
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I'm not aware of any either, other than manually calculating row sizes based on information in Chapter 2 of Parallel Job Developer's Guide about storage sizes of data types. And anything you put in to calculate the same would degrade the job's performance.

But, of course, any data moved around in parallel jobs uses buffers of various kinds, and of other kinds when re-partitioning. So you would need to round up to the nearest whole buffer size. For example writing even one row into a Data Set will move at least 32KB.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

It would be very interesting to know "why"?

What is the requirement for having such detail at the "byte" level? ...and then further, as Ray notes, is the question about "real" bytes, accurately measured uniquely for every variable record? ...internal memory representation? ...with buffers and whatever else internal needs to be applied? At the link level? Aggregated across all links and Stages?

Lots of possibilities.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Perhaps because Informatica does? :wink:

I've never found it all that useful, honestly, but it is shown in their monitor. Would also be curious as to the why of this.
-craig

"You can never have too many knives" -- Logan Nine Fingers
AshishDevassy
Participant
Posts: 32
Joined: Fri Nov 07, 2014 11:18 am

Post by AshishDevassy »

Yea What Ray said was more or less a last resort..
The reason behind this exercise is to see the amount of data that we are moving using ETL. Management request :)
And it had me curious too

I generated a performance Report for a job and That had a chart which had a bytes used by the job (doesn't look like the same thing)

I was thinking of going about it using the database.
i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
long and winding
that's wierd ?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

AshishDevassy wrote:I was thinking of going about it using the database. i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
What number of bytes - the full size according to the metadata? The Average Record Length derived from the data itself? Or by summing the true size of each record based on the data in each row? Then there's the potential difference of the data size in flight versus in storage.

Sounds like fun regardless...
-craig

"You can never have too many knives" -- Logan Nine Fingers
AshishDevassy
Participant
Posts: 32
Joined: Fri Nov 07, 2014 11:18 am

Post by AshishDevassy »

for example
Table A - Bytes per record 100
No of records loaded last night 500
bytes loaded - 50000

That's the best that I can think of ...
that's wierd ?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sure... still just wondering where that 'bytes per record' value came from.
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

An alternative outside of DataStage would be to request a network person to monitor inbound and outbound network traffic with your DataStage server over an average 24 hour period and send you a report with the number of bytes for each. There are probably operating system commands that would let you do the same thing yourself too.
Choose a job you love, and you will never have to work a day in your life. - Confucius
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Network traffic is not all data. It includes security and SQL statements passed back and forth. Bytes written is the amount of work being done. So it is better predictor of performance than rows / second.

In EtlStats is a routine which tries to estimate the average row length. Now most database s will calculate this when you run stats. They store it in the information schema. Once you get this length you still need to tie to the link in the job which write to this table. Getting table name, job name and link name in a table then average row length in a table then joining all that with row counts will give you the number. Not a simple task. How much benefit will you get from knowing these numbers? Will it pay for the effort to tie row length to job link name?

The EtlStats routine uses an estimate on varchar fields and other fields where length is either a max length or a packed field like numbers. For instance DECIMAL(18,2) does not take up 18 bytes. It is packed based on how the database works. VARCHAR(40) is based on average length. So in the routine you could hard code a percentage so all varchars would average 50% or 70% of max length. So at best it was an estimate. Average itself is an estimate of how long a row is. It is probably a better estimate than anything we could guess. So getting these out of information schema is probably best.

Oracle and DB2 I know have average row length. How it gets it you need to look up. So I think it is possible. Let us know if you do it and how you did it. Would be nice to see if it was worth it or you learned anything. Bytes per second should be consistent across all similar tables in a given database. Faster server should improve these numbers. So load times should be easier to predict in theory.

Good luck.
Mamu Kim
AshishDevassy
Participant
Posts: 32
Joined: Fri Nov 07, 2014 11:18 am

Post by AshishDevassy »

So I talked to the DBAs.
And they said will be able to give me the table names and the bytes per record (using something called DCLGENs)
multiplying that by the count of records loaded by ETL would give a rough estimate on the data bytes moved by ETL nightly.
that's wierd ?
Cranie
Premium Member
Premium Member
Posts: 19
Joined: Fri Mar 03, 2006 6:02 pm
Location: Sheffield

Post by Cranie »

This is very possible.

We have a process that captures the DS logs, design and run information. This runs every few hours. The job is DS basic and captures 1500 job instance runs in about 10minutes (so it's pretty quick and not resource hungry)

We can then extrapolate details such as how much data (bytes, rows) have gone down each link and even to the level of how much has gone down each partition. This is very useful in identifying skewed PX jobs, jobs with high CPU and low row and gives you a first glance at what could be a focus point.

Most of these features are detailed in the BASIC documentation. I can see many benefits to having this.

Some of it needs to be guess work - i.e. data is stored and compressed in some instances, varchar fields may not always be 255!!! (we take an average). So while not an exact process it is a very good indicator and gives you a sense of what the environment is doing. This is great for capacity planning too.
- - - - -

Cranie

Doing the needful.
AshishDevassy
Participant
Posts: 32
Joined: Fri Nov 07, 2014 11:18 am

Post by AshishDevassy »

@Cranie - Can you give some more details of the DS job that you used to capture the said details.
that's wierd ?
Post Reply