Comments
8 comments
-
Hi,
The job overrun is triggered when the job takes a lot longer than the average time which we track from previous runs.
A job not starting is not the same as a failure (which would be reported as Job Failure). Job did not start means that we found a time when the job should have started, however the job is not currently running and hasn't recently completed (by checking it's last run date).
Note that if a job may not start because it was still completing a previous run, we do try to detect this and report it as "job did not start (overrun)".
If this doesn't help, perhaps you could send me the details of the job schedule and how often it runs, and a sample of the recent job history? You can put that in a pm to me here, or if you like I can pm you my email to send it to.
Thanks,
Nigel -
Hi Nigel,
ALL my jobs are showing as "Job overrun" - it's because they are being seen by SQL Response for the first time ever, and "this" exec is definitely longer than the previous...
The jobs run every 15 minutes and use about 5 seconds of wall time...
The "Did not run" is definitely a red herring! SSMS Job Monitor happily shows each job ran... -
It won't complain about overrun the first time (as it has no baseline). After the first run it will note the time taken and each subsequent run recalculate the average, so if the first run was 1 second and the next 5 seconds then that would be overrun. After a few runs it should be more accurate. I think for the final release we'll make it wait a few runs before declaring overruns.
For the did not run issue - could you send me a screen shot of the incidents, and a sample of the job history from sp_help_jobhistory (preferably covering the same time as reported in the incident)? I'll pm you my email. -
PDinCA wrote:The "Did not run" is definitely a red herring! SSMS Job Monitor happily shows each job ran...
I've realised what you meant here now. The "Did not run" was refering to the job step in the panel below as part of the job overrun incident. As you said this is a red herring. We put all the job information into the tab so that in the case of a failure you can see which steps ran and which didn't. However in a job overrun we don't say which steps have run, and so the UI is displaying it as "did not run" whereas it should say something like "for information"
We're going to try hard and see if we can reproduce an erroneous "job did start" incident today. -
I'm also getting "red herrings" of jobs/steps not running. In the job steps tab, in the "Additional Information" area it says "The job step did not run."
I'm currently monitoring only 3 servers. They have 1000+ open incidents over the past couple of days. Most seem to be red herrings... -
The "job step did not run" in the additional information is just the way the UI is displaying it - i.e. we always show all the job steps, but which ones ran or didn't will only be usefull for a job failure incident - in all others it's there for information.
As for the 1000+ incidents I suspect this is due to the timezone bug that we found which generates false incidents for jobs in the future. We've fixed that. here for next release.
If you think you're getting other erroneous incidents then please do let us know as we'd really like to fix that!
Thanks,
Nigel -
Sent further snapshots for job scheduled at 2 minute intervals, with durations of up to 6 seconds only - shows as "Job overrun" Incident.
-
Those incidents are different from the "Job did not start (overrunning)" ones. A job overrun is where a run of a job takes significantly longer than it's average run - so if a job usually completes within a second and then takes 6 we tell you. I think we're planning to make it wait a number of seconds before it starts complaining as jobs that are a few seconds long aren't that accurate with the timings.
Add comment
Please sign in to leave a comment.
I just got a message pertaining to a job overrun. The documentation indicates this means the job ran 'long'... my question is, "how do it know?". I do not see any means whereby I can indicate the length of time to let the job run before deciding it ran long.
Also, in the details of this 'alert' there was a momentary panic when I read that the job did not start - as this is a regularly scheduled job, I was nervous until I went to the server and checked the job history - no failures, so I do not know why the message indicated that the job did not run.
(actual message under Additional Information was "This job step did not run")