Issue with slow job startup times
Moderators: chulett, rschirm, roy
Issue with slow job startup times
This hasn't always been the case, but as of now we have some jobs in our production environment with poor startup times, but only when our cycles are running. When there is zero load on the box, the jobs run perfectly fine. I don't know of course whether it has to do with resources or some other issue like disk io, temp size, etc.
Here's an example of one: main_program: Startup time, 30:28; production run time, 0:57.
In this job, for testing purposes now, I have set $APT_PM_PLAYER_TIMING = True, and $APT_STARTUP_STATUS = True
The job started at 2:16 AM. In the log, all within 2 seconds, it logs all of the database queries it prepared, the job score step, and a few warnings. Then, the job hangs for 22 min. It's not till after that 22 min that it starts logging per the environment variables I mention for startup, where it mentions it's defining section leaders and such.
At the end of the job, check this out:
02:39:25
main_program: Releasing Section Leaders; parallel step time was 57.307 seconds.
02:47:48
main_program: Returning step status of APT_StatusOk.
When the job is seemingly done, it hangs another 8 minutes, which apparently gets factored into the startup time.
While the job was hanging during the 22 minute period I mentioned, I checked what was going on on the box.
There were approximately 130 temp files, 960 &PH& files, 120 datasets, 530 datasetdata files, and the box was ranging between 5-10% CPU usage. The log is fairly small, only about 5 days worth. None of this seems to me like it would cause such a terrible startup time.
Anyone have any thoughts? We are on DS 8.7 on Linux by the way.
Here's an example of one: main_program: Startup time, 30:28; production run time, 0:57.
In this job, for testing purposes now, I have set $APT_PM_PLAYER_TIMING = True, and $APT_STARTUP_STATUS = True
The job started at 2:16 AM. In the log, all within 2 seconds, it logs all of the database queries it prepared, the job score step, and a few warnings. Then, the job hangs for 22 min. It's not till after that 22 min that it starts logging per the environment variables I mention for startup, where it mentions it's defining section leaders and such.
At the end of the job, check this out:
02:39:25
main_program: Releasing Section Leaders; parallel step time was 57.307 seconds.
02:47:48
main_program: Returning step status of APT_StatusOk.
When the job is seemingly done, it hangs another 8 minutes, which apparently gets factored into the startup time.
While the job was hanging during the 22 minute period I mentioned, I checked what was going on on the box.
There were approximately 130 temp files, 960 &PH& files, 120 datasets, 530 datasetdata files, and the box was ranging between 5-10% CPU usage. The log is fairly small, only about 5 days worth. None of this seems to me like it would cause such a terrible startup time.
Anyone have any thoughts? We are on DS 8.7 on Linux by the way.
Have you cleared your &PH& file for the project recently? It contains a log of every job run for the project and it never deletes them. After several years you can have hundreds of thousands of entries in the directory list, which can cause severe startup slowdowns.
Also - remember any Before Job actions count as "startup".
Also - remember any Before Job actions count as "startup".
See my post above, we have <1000 &PH& files. If this is still too many, let me know. I believe we delete all files older than 7 days.
Also, every job in the project is not affected. Plenty run with <30 second startup times. But I imagine it could vary depending on what's happening on the server at the time.
We have no before-job or after-job subroutines if that's what you are referring to.
Also, every job in the project is not affected. Plenty run with <30 second startup times. But I imagine it could vary depending on what's happening on the server at the time.
We have no before-job or after-job subroutines if that's what you are referring to.
We do not have this parameter set in our DSParams file.
My understanding is that (correct me if I am wrong), these parameters determine whether logging goes to RT_LOGXXX files or the XMETA repository. I know logging goes to RT_LOGXXX files, as I can see them grow as the log grows (The largest is 26mb, but most are <2mb).
My understanding is that (correct me if I am wrong), these parameters determine whether logging goes to RT_LOGXXX files or the XMETA repository. I know logging goes to RT_LOGXXX files, as I can see them grow as the log grows (The largest is 26mb, but most are <2mb).
Bascially correct, they control if the log entries go to the associated repository or not. Technically, both could be set to True and the logs could be going to both locations. I'm surprised that neither is present in your DSParams file, I wonder if the 'default' is true if the environment variable cannot be found? Or have they been moved elsewhere? 
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Remember that the entry in the log is NOT the time that the event happened. It is the time that the event was written to the log. There is a difference.
Is your environment GRID enabled? Are you waiting to be released from the queue?
Can you describe what your job does? Is it connecting to a database and the resultset is being built during that startup time, and you only see the next log event when Row #1 gets returned from the connector stage...
Is your environment GRID enabled? Are you waiting to be released from the queue?
Can you describe what your job does? Is it connecting to a database and the resultset is being built during that startup time, and you only see the next log event when Row #1 gets returned from the connector stage...
The environment is not on a grid.
This particular job does have a lot going on. It does have quite a few DB connections and lookups. I've seen it with plenty of smaller jobs also though, even one that only made a connection to a database and wrote a table to a dataset. In scenarios like that, I wouldn't think there should be any action like that which occurs during the startup time. I would think that the timing for the startup status is pretty accurately logged, but I imagine that's still a guess.
I would love to try and prove out the actual timing on what's going on in the job, but not sure how. Perhaps I can add APT_PM_SHOW_PIDS, and have an external script looping around to write out the processes to some sort of log, and compare the timing afterwards.
This particular job does have a lot going on. It does have quite a few DB connections and lookups. I've seen it with plenty of smaller jobs also though, even one that only made a connection to a database and wrote a table to a dataset. In scenarios like that, I wouldn't think there should be any action like that which occurs during the startup time. I would think that the timing for the startup status is pretty accurately logged, but I imagine that's still a guess.
I would love to try and prove out the actual timing on what's going on in the job, but not sure how. Perhaps I can add APT_PM_SHOW_PIDS, and have an external script looping around to write out the processes to some sort of log, and compare the timing afterwards.
Only if they moved it. Introduced in 8.1 in the DSParams file.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Right. For sanity's sake, I checked both places, and it's defined in neither.chulett wrote:Only if they moved it. Introduced in 8.1 in the DSParams file.

