Job overrun... how do it know?

Hello -

I just got a message pertaining to a job overrun. The documentation indicates this means the job ran 'long'... my question is, "how do it know?". I do not see any means whereby I can indicate the length of time to let the job run before deciding it ran long.

Also, in the details of this 'alert' there was a momentary panic when I read that the job did not start - as this is a regularly scheduled job, I was nervous until I went to the server and checked the job history - no failures, so I do not know why the message indicated that the job did not run.

(actual message under Additional Information was "This job step did not run")

randyv February 13, 2008 11:25

Comments

8 comments

Hi,
The job overrun is triggered when the job takes a lot longer than the average time which we track from previous runs.

A job not starting is not the same as a failure (which would be reported as Job Failure). Job did not start means that we found a time when the job should have started, however the job is not currently running and hasn't recently completed (by checking it's last run date).

Note that if a job may not start because it was still completing a previous run, we do try to detect this and report it as "job did not start (overrun)".

If this doesn't help, perhaps you could send me the details of the job schedule and how often it runs, and a sample of the recent job history? You can put that in a pm to me here, or if you like I can pm you my email to send it to.

Thanks,
Nigel

Nigel Morse February 13, 2008 12:20

0
Hi Nigel,

ALL my jobs are showing as "Job overrun" - it's because they are being seen by SQL Response for the first time ever, and "this" exec is definitely longer than the previous...

The jobs run every 15 minutes and use about 5 seconds of wall time...

The "Did not run" is definitely a red herring! SSMS Job Monitor happily shows each job ran...

PDinCA February 13, 2008 16:29

0
It won't complain about overrun the first time (as it has no baseline). After the first run it will note the time taken and each subsequent run recalculate the average, so if the first run was 1 second and the next 5 seconds then that would be overrun. After a few runs it should be more accurate. I think for the final release we'll make it wait a few runs before declaring overruns.

For the did not run issue - could you send me a screen shot of the incidents, and a sample of the job history from sp_help_jobhistory (preferably covering the same time as reported in the incident)? I'll pm you my email.

Nigel Morse February 13, 2008 16:40

0
PDinCA wrote:

The "Did not run" is definitely a red herring! SSMS Job Monitor happily shows each job ran...

I've realised what you meant here now. The "Did not run" was refering to the job step in the panel below as part of the job overrun incident. As you said this is a red herring. We put all the job information into the tab so that in the case of a failure you can see which steps ran and which didn't. However in a job overrun we don't say which steps have run, and so the UI is displaying it as "did not run" whereas it should say something like "for information"

We're going to try hard and see if we can reproduce an erroneous "job did start" incident today.

Nigel Morse February 14, 2008 05:24

0
I'm also getting "red herrings" of jobs/steps not running. In the job steps tab, in the "Additional Information" area it says "The job step did not run."

I'm currently monitoring only 3 servers. They have 1000+ open incidents over the past couple of days. Most seem to be red herrings...

philcruz February 15, 2008 10:07

0
The "job step did not run" in the additional information is just the way the UI is displaying it - i.e. we always show all the job steps, but which ones ran or didn't will only be usefull for a job failure incident - in all others it's there for information.

As for the 1000+ incidents I suspect this is due to the timezone bug that we found which generates false incidents for jobs in the future. We've fixed that. here for next release.

If you think you're getting other erroneous incidents then please do let us know as we'd really like to fix that!

Thanks,
Nigel

Nigel Morse February 18, 2008 04:57

0
Sent further snapshots for job scheduled at 2 minute intervals, with durations of up to 6 seconds only - shows as "Job overrun" Incident.

PDinCA February 26, 2008 14:39

0
Those incidents are different from the "Job did not start (overrunning)" ones. A job overrun is where a run of a job takes significantly longer than it's average run - so if a job usually completes within a second and then takes 6 we tell you. I think we're planning to make it wait a number of seconds before it starts complaining as jobs that are a few seconds long aren't that accurate with the timings.

Nigel Morse February 27, 2008 05:22

0

Add comment

Please sign in to leave a comment.

How can we help you today?

Job overrun... how do it know?

Comments

Add comment