Issue with slow job startup times

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Issue with slow job startup times

Post by j.press »

This hasn't always been the case, but as of now we have some jobs in our production environment with poor startup times, but only when our cycles are running. When there is zero load on the box, the jobs run perfectly fine. I don't know of course whether it has to do with resources or some other issue like disk io, temp size, etc.

Here's an example of one: main_program: Startup time, 30:28; production run time, 0:57.

In this job, for testing purposes now, I have set $APT_PM_PLAYER_TIMING = True, and $APT_STARTUP_STATUS = True

The job started at 2:16 AM. In the log, all within 2 seconds, it logs all of the database queries it prepared, the job score step, and a few warnings. Then, the job hangs for 22 min. It's not till after that 22 min that it starts logging per the environment variables I mention for startup, where it mentions it's defining section leaders and such.

At the end of the job, check this out:

02:39:25
main_program: Releasing Section Leaders; parallel step time was 57.307 seconds.

02:47:48
main_program: Returning step status of APT_StatusOk.

When the job is seemingly done, it hangs another 8 minutes, which apparently gets factored into the startup time.

While the job was hanging during the 22 minute period I mentioned, I checked what was going on on the box.

There were approximately 130 temp files, 960 &PH& files, 120 datasets, 530 datasetdata files, and the box was ranging between 5-10% CPU usage. The log is fairly small, only about 5 days worth. None of this seems to me like it would cause such a terrible startup time.

Anyone have any thoughts? We are on DS 8.7 on Linux by the way.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Have you cleared your &PH& file for the project recently? It contains a log of every job run for the project and it never deletes them. After several years you can have hundreds of thousands of entries in the directory list, which can cause severe startup slowdowns.

Also - remember any Before Job actions count as "startup".
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

See my post above, we have <1000 &PH& files. If this is still too many, let me know. I believe we delete all files older than 7 days.

Also, every job in the project is not affected. Plenty run with <30 second startup times. But I imagine it could vary depending on what's happening on the server at the time.

We have no before-job or after-job subroutines if that's what you are referring to.
PaulVL
Premium Member
Premium Member
Posts: 1309
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

What are your settings for RTLogging and ORLogging in your project DSParams ?
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

We do not have this parameter set in our DSParams file.

My understanding is that (correct me if I am wrong), these parameters determine whether logging goes to RT_LOGXXX files or the XMETA repository. I know logging goes to RT_LOGXXX files, as I can see them grow as the log grows (The largest is 26mb, but most are <2mb).
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Bascially correct, they control if the log entries go to the associated repository or not. Technically, both could be set to True and the logs could be going to both locations. I'm surprised that neither is present in your DSParams file, I wonder if the 'default' is true if the environment variable cannot be found? Or have they been moved elsewhere? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

Yeah I can't answer that for sure. I spoke with the admin team and they claimed that no logging was done to XMETA. I don't have direct access to verify myself.
PaulVL
Premium Member
Premium Member
Posts: 1309
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Remember that the entry in the log is NOT the time that the event happened. It is the time that the event was written to the log. There is a difference.

Is your environment GRID enabled? Are you waiting to be released from the queue?

Can you describe what your job does? Is it connecting to a database and the resultset is being built during that startup time, and you only see the next log event when Row #1 gets returned from the connector stage...
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

The environment is not on a grid.

This particular job does have a lot going on. It does have quite a few DB connections and lookups. I've seen it with plenty of smaller jobs also though, even one that only made a connection to a database and wrote a table to a dataset. In scenarios like that, I wouldn't think there should be any action like that which occurs during the startup time. I would think that the timing for the startup status is pretty accurately logged, but I imagine that's still a guess.

I would love to try and prove out the actual timing on what's going on in the job, but not sure how. Perhaps I can add APT_PM_SHOW_PIDS, and have an external script looping around to write out the processes to some sort of log, and compare the timing afterwards.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Aren't RTLogging and ORLogging in the uvconfig file, not DSParams?
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

-craig

"You can never have too many knives" -- Logan Nine Fingers
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

chulett wrote:Only if they moved it. Introduced in 8.1 in the DSParams file.
Right. For sanity's sake, I checked both places, and it's defined in neither.
PaulVL
Premium Member
Premium Member
Posts: 1309
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

J.press, to debug where the slowness is... make a copy of the job and start changing it.

Cut an input database connection and point it to a dataset/text file one at a time.
Work your way down the list.
Cut the target database/file to a peek stage.
j.press
Premium Member
Premium Member
Posts: 8
Joined: Tue Apr 17, 2012 3:11 pm

Post by j.press »

Well, the issue is that right now we only see this occuring in our production environment, during our nightly batch cycle. It's difficult for me to add anything to these jobs for testing purposes besides additional logging and such.
Post Reply