| Summary: | Fixes for race condition in job manager | ||
|---|---|---|---|
| Product: | GRAM | Reporter: | Alain Roy <roy@cs.wisc.edu> |
| Component: | gt2 Gatekeeper/Jobmanager | Assignee: | Joe Bester <bester@mcs.anl.gov> |
| Status: | RESOLVED FIXED | ||
| Severity: | major | CC: | bester@mcs.anl.gov, David.Smith@cern.ch, jfrey@cs.wisc.edu, litmaath@cern.ch, parag@cs.wisc.edu, smartin@mcs.anl.gov |
| Priority: | P2 | Keywords: | VDT |
| Version: | 1.6 | ||
| Target Milestone: | 4.2.1 | ||
| Hardware: | PC | ||
| OS: | All | ||
| Bug Depends on: | |||
| Bug Blocks: | 6192 | ||
| Attachments: | Patch for race condition | ||
Created an attachment (id=314) [details]
Patch for race condition
Patch for race condition
What is the reason for synthesizing commit messages when the last callback is unregistered? Shouldn't the job manager just go into the STOP state and save it's state in this case? joe
Hello Joe, There are two parts to this patch. The change in globus_l_gram_job_manager_query_valid() relates to the decision on whether the jobmanager should process incoming queries. As originaly written, if the job manager is in the process of restarting then the 'restart' state is checked - the restart state is used to compute the state to jump to after the initial two phase commit on a jobmanager restart. So firstly it seems a little inconsistent to check the state directly. But apart from that the final state does not seem to be relevant for the query - the problem we saw because of this was: We saw that a stdio update 'query' was allowed while a restarted job manager was being shutdown (this sequence of events was exceptional, but the job manager should behave reasonably). The job manager then hung when GRAM was trying to shutdown - The query reply was never sent and so there was an outstanding connection. In short I thought that it was more approprate to test for the validity of the query according to the current job manager state. The second part of this patch related to the final commit wait when there are no longer any gram callbacks registered. (This goes together with bug #1551). The problem we saw was that if a job was submitted with two phase requested, callbacks unregistered and the job left, it would hang for the duration of the two phase commit when the job finished. eg. if you submit a job with globusrun in batch mode and request two_phase in the rsl you find this behaviour. It seemed that if it was reasonable to assume that nobody will send the final commit signal if there are no callbacks registered, so in this case I wanted to avoid the wait. The patches here move the state to the commited state if the last callback is unregistered during the wait or, for bug #1551, if they are unregistered before. It may be that you think it is more approprate to move to STOP, ie. consiered the commit failed. Please feel free to impliment this as you see fit. Thanks, David
David, I don't agree on the second part of this patch. I don't think you can assume that the job should be cleaned up just because there are no registered callbacks. Maybe a better solution would be to have different timeout values for the different waiting phases of the two_phase commit? So you could submit a job with waiting phase of the job submission (COMMIT_REQUEST) to 20 minutes and the waiting phase of the job cleanup (COMMIT_END) to 2 seconds.
Hi Stuart, If we say that the two phase commit symantics requires the job manager to wait then I agree with you - since these patches break that. I suppose you could imagine a case where there is no callback registered but there are periodic queries to check the state (where the period is less than the two phase commit time) after which the final commit is supplied. Some other, non-gram, protocol might even concievably be used to check the state. However, if we only define the second commit wait when there is a client registered for callbacks then we could potentialy save JMs hanging around if someone submits unattended jobs with two-phase enabled (because they wanted the initial commit feature). In this case perhaps Joe's initial suggestion, to move to STOP rather than DONE is a good idea. Anyway, since there is some discussion about this, I propose that you ignore these patchs for now. (That is the second part of this patch, 1550 and patch 1551). As for your suggestion, I feel that it is not worth while for you to do it. We can simply leave it that if a two-phase commit is specified the JM will wait. In our case the problem was noticed as part of tests desinged to check the impact on the system when using the standard globus tools to manage jobs. By default we use condor-G to manage job submission, in which case this situation shouldn't arise - although in exceptional circumstances it might be possible. It could be wothwhile changing globusrun to issue a warning in case you submit a batch job in this way. Or at least add documentation to the tools that 'batch' jobs with a significat two-phase commit period may not desirable because of the wait at end of the job. Many thanks, David
The patches for these are committed to 4.2 branch and trunk.