Bugzilla – Bug 1550
Fixes for race condition in job manager
Last modified: 2008-08-15 04:44:11
You need to
before you can comment on or make changes to this bug.
This is a patch by David Smith from CERN to fix a race condition in the job
manager. It has been tested as part of the VDT.
I will attach it separately.
Created an attachment (id=314) [details]
Patch for race condition
Patch for race condition
What is the reason for synthesizing commit messages when the last callback is
unregistered? Shouldn't the job manager just go into the STOP state and save
it's state in this case?
There are two parts to this patch.
The change in globus_l_gram_job_manager_query_valid() relates to the decision on
whether the jobmanager should process incoming queries. As originaly written, if
the job manager is in the process of restarting then the 'restart' state is
checked - the restart state is used to compute the state to jump to after the
initial two phase commit on a jobmanager restart. So firstly it seems a little
inconsistent to check the state directly. But apart from that the final state
does not seem to be relevant for the query - the problem we saw because of this was:
We saw that a stdio update 'query' was allowed while a restarted job manager was
being shutdown (this sequence of events was exceptional, but the job manager
should behave reasonably). The job manager then hung when GRAM was trying to
shutdown - The query reply was never sent and so there was an outstanding
connection. In short I thought that it was more approprate to test for the
validity of the query according to the current job manager state.
The second part of this patch related to the final commit wait when there are no
longer any gram callbacks registered. (This goes together with bug #1551). The
problem we saw was that if a job was submitted with two phase requested,
callbacks unregistered and the job left, it would hang for the duration of the
two phase commit when the job finished. eg. if you submit a job with globusrun
in batch mode and request two_phase in the rsl you find this behaviour.
It seemed that if it was reasonable to assume that nobody will send the final
commit signal if there are no callbacks registered, so in this case I wanted to
avoid the wait. The patches here move the state to the commited state if the
last callback is unregistered during the wait or, for bug #1551, if they are
unregistered before. It may be that you think it is more approprate to move to
STOP, ie. consiered the commit failed. Please feel free to impliment this as you
I don't agree on the second part of this patch. I don't think you can assume that the job
should be cleaned up just because there are no registered callbacks. Maybe a better
solution would be to have different timeout values for the different waiting phases of the
two_phase commit? So you could submit a job with waiting phase of the job submission
(COMMIT_REQUEST) to 20 minutes and the waiting phase of the job cleanup
(COMMIT_END) to 2 seconds.
If we say that the two phase commit symantics requires the job manager to wait
then I agree with you - since these patches break that. I suppose you could
imagine a case where there is no callback registered but there are periodic
queries to check the state (where the period is less than the two phase commit
time) after which the final commit is supplied. Some other, non-gram, protocol
might even concievably be used to check the state.
However, if we only define the second commit wait when there is a client
registered for callbacks then we could potentialy save JMs hanging around if
someone submits unattended jobs with two-phase enabled (because they wanted the
initial commit feature). In this case perhaps Joe's initial suggestion, to move
to STOP rather than DONE is a good idea.
Anyway, since there is some discussion about this, I propose that you ignore
these patchs for now. (That is the second part of this patch, 1550 and patch 1551).
As for your suggestion, I feel that it is not worth while for you to do it. We
can simply leave it that if a two-phase commit is specified the JM will wait. In
our case the problem was noticed as part of tests desinged to check the impact
on the system when using the standard globus tools to manage jobs. By default we
use condor-G to manage job submission, in which case this situation shouldn't
arise - although in exceptional circumstances it might be possible.
It could be wothwhile changing globusrun to issue a warning in case you submit a
batch job in this way. Or at least add documentation to the tools that 'batch'
jobs with a significat two-phase commit period may not desirable because of the
wait at end of the job.
The patches for these are committed to 4.2 branch and trunk.