<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://bugzilla.globus.org/bugzilla/bugzilla.dtd">

<bugzilla version="3.2.3"
          urlbase="http://bugzilla.globus.org/bugzilla/"
          maintainer="bacon@mcs.anl.gov"
>

    <bug>
          <bug_id>1550</bug_id>
          
          <creation_ts>2004-02-13 14:24</creation_ts>
          <short_desc>Fixes for race condition in job manager</short_desc>
          <delta_ts>2008-08-15 04:44:11</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>GRAM</product>
          <component>gt2 Gatekeeper/Jobmanager</component>
          <version>1.6</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          
          <keywords>VDT</keywords>
          <priority>P2</priority>
          <bug_severity>major</bug_severity>
          <target_milestone>4.2.1</target_milestone>
          
          <blocked>6192</blocked>
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Alain Roy">roy@cs.wisc.edu</reporter>
          <assigned_to name="Joe Bester">bester@mcs.anl.gov</assigned_to>
          <cc>bester@mcs.anl.gov</cc>
    
    <cc>David.Smith@cern.ch</cc>
    
    <cc>jfrey@cs.wisc.edu</cc>
    
    <cc>litmaath@cern.ch</cc>
    
    <cc>parag@cs.wisc.edu</cc>
    
    <cc>smartin@mcs.anl.gov</cc>

      

      
          <long_desc isprivate="0">
            <who name="Alain Roy">roy@cs.wisc.edu</who>
            <bug_when>2004-02-13 14:24:05</bug_when>
            <thetext>This is a patch by David Smith from CERN to fix a race condition in the job 
manager. It has been tested as part of the VDT.

I will attach it separately.

-alain</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Alain Roy">roy@cs.wisc.edu</who>
            <bug_when>2004-02-13 14:28:42</bug_when>
            <thetext>Created an attachment (id=314)
Patch for race condition

Patch for race condition</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Joe Bester">bester@mcs.anl.gov</who>
            <bug_when>2004-03-04 10:31:05</bug_when>
            <thetext>What is the reason for synthesizing commit messages when the last callback is 
unregistered? Shouldn&apos;t the job manager just go into the STOP state and save 
it&apos;s state in this case? 
 
joe </thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Smith">David.Smith@cern.ch</who>
            <bug_when>2004-03-09 12:11:01</bug_when>
            <thetext>Hello Joe,

There are two parts to this patch.

The change in globus_l_gram_job_manager_query_valid() relates to the decision on
whether the jobmanager should process incoming queries. As originaly written, if
the job manager is in the process of restarting then the &apos;restart&apos; state is
checked - the restart state is used to compute the state to jump to after the
initial two phase commit on a jobmanager restart. So firstly it seems a little
inconsistent to check the state directly. But apart from that the final state
does not seem to be relevant for the query - the problem we saw because of this was:

We saw that a stdio update &apos;query&apos; was allowed while a restarted job manager was
being shutdown (this sequence of events was exceptional, but the job manager
should behave reasonably). The job manager then hung when GRAM was trying to
shutdown - The query reply was never sent and so there was an outstanding
connection. In short I thought that it was more approprate to test for the
validity of the query according to the current job manager state.

The second part of this patch related to the final commit wait when there are no
longer any gram callbacks registered. (This goes together with bug #1551). The
problem we saw was that if a job was submitted with two phase requested,
callbacks unregistered and the job left, it would hang for the duration of the
two phase commit when the job finished. eg. if you submit a job with globusrun
in batch mode and request two_phase in the rsl you find this behaviour.

It seemed that if it was reasonable to assume that nobody will send the final
commit signal if there are no callbacks registered, so in this case I wanted to
avoid the wait. The patches here move the state to the commited state if the
last callback is unregistered during the wait or, for bug #1551, if they are
unregistered before. It may be that you think it is more approprate to move to
STOP, ie. consiered the commit failed. Please feel free to impliment this as you
see fit.

Thanks,
David
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Stuart Martin">smartin@mcs.anl.gov</who>
            <bug_when>2004-03-15 11:30:45</bug_when>
            <thetext>David,

I don&apos;t agree on the second part of this patch.  I don&apos;t think you can assume that the job 
should be cleaned up just because there are no registered callbacks.  Maybe a better 
solution would be to have different timeout values for the different waiting phases of the 
two_phase commit?  So you could submit a job with waiting phase of the job submission 
(COMMIT_REQUEST) to 20 minutes and the waiting phase of the job cleanup 
(COMMIT_END) to 2 seconds.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Smith">David.Smith@cern.ch</who>
            <bug_when>2004-03-17 04:25:41</bug_when>
            <thetext>Hi Stuart,

If we say that the two phase commit symantics requires the job manager to wait
then I agree with you - since these patches break that. I suppose you could
imagine a case where there is no callback registered but there are periodic
queries to check the state (where the period is less than the two phase commit
time) after which the final commit is supplied. Some other, non-gram, protocol
might even concievably be used to check the state.

However, if we only define the second commit wait when there is a client
registered for callbacks then we could potentialy save JMs hanging around if
someone submits unattended jobs with two-phase enabled (because they wanted the
initial commit feature). In this case perhaps Joe&apos;s initial suggestion, to move
to STOP rather than DONE is a good idea.

Anyway, since there is some discussion about this, I propose that you ignore
these patchs for now. (That is the second part of this patch, 1550 and patch 1551).

As for your suggestion, I feel that it is not worth while for you to do it. We
can simply leave it that if a two-phase commit is specified the JM will wait. In
our case the problem was noticed as part of tests desinged to check the impact
on the system when using the standard globus tools to manage jobs. By default we
use condor-G to manage job submission, in which case this situation shouldn&apos;t
arise - although in exceptional circumstances it might be possible.

It could be wothwhile changing globusrun to issue a warning in case you submit a
batch job in this way. Or at least add documentation to the tools that &apos;batch&apos;
jobs with a significat two-phase commit period may not desirable because of the
wait at end of the job.

Many thanks,
David
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Joe Bester">bester@mcs.anl.gov</who>
            <bug_when>2008-08-15 04:44:11</bug_when>
            <thetext>The patches for these are committed to 4.2 branch and trunk.</thetext>
          </long_desc>
      
          <attachment
              isobsolete="0"
              ispatch="1"
              isprivate="0"
          >
            <attachid>314</attachid>
            <date>2004-02-13 14:28</date>
            <desc>Patch for race condition</desc>
            <filename>36__gram_job_manager_query.diff</filename>
            <type>text/plain</type>
            <size>2008</size>
            <attacher>roy@cs.wisc.edu</attacher>
            <data encoding="base64">LS0tIGJhZC9ncmFtL2pvYm1hbmFnZXIvc291cmNlL2dsb2J1c19ncmFtX2pvYl9tYW5hZ2VyX3F1
ZXJ5LmMJTW9uIEZlYiAgMiAxNjozMjoxMyAyMDA0CisrKyBmaXgvZ3JhbS9qb2JtYW5hZ2VyL3Nv
dXJjZS9nbG9idXNfZ3JhbV9qb2JfbWFuYWdlcl9xdWVyeS5jCU1vbiBGZWIgIDIgMTY6Mzc6NTcg
MjAwNApAQCAtNDQsNyArNDQsOCBAQAogaW50CiBnbG9idXNfbF9ncmFtX2pvYl9tYW5hZ2VyX3Vu
cmVnaXN0ZXIoCiAgICAgZ2xvYnVzX2dyYW1fam9ibWFuYWdlcl9yZXF1ZXN0X3QgKglyZXF1ZXN0
LAotICAgIGNvbnN0IGNoYXIgKgkJCXVybCk7CisgICAgY29uc3QgY2hhciAqCQkJdXJsLAorICAg
IGdsb2J1c19ncmFtX3Byb3RvY29sX2hhbmRsZV90CWhhbmRsZSk7CiAKIHN0YXRpYwogaW50CkBA
IC0xNzAsNyArMTcxLDcgQEAKICAgICB9CiAgICAgZWxzZSBpZiAoc3RyY21wKHF1ZXJ5LCJ1bnJl
Z2lzdGVyIik9PTApCiAgICAgewotCXJjID0gZ2xvYnVzX2xfZ3JhbV9qb2JfbWFuYWdlcl91bnJl
Z2lzdGVyKHJlcXVlc3QsIHJlc3QpOworCXJjID0gZ2xvYnVzX2xfZ3JhbV9qb2JfbWFuYWdlcl91
bnJlZ2lzdGVyKHJlcXVlc3QsIHJlc3QsIGhhbmRsZSk7CiAgICAgfQogICAgIGVsc2UgaWYgKHN0
cmNtcChxdWVyeSwicmVuZXciKT09MCkKICAgICB7CkBAIC00NDYsNyArNDQ3LDkgQEAKIGludAog
Z2xvYnVzX2xfZ3JhbV9qb2JfbWFuYWdlcl91bnJlZ2lzdGVyKAogICAgIGdsb2J1c19ncmFtX2pv
Ym1hbmFnZXJfcmVxdWVzdF90ICoJcmVxdWVzdCwKLSAgICBjb25zdCBjaGFyICoJCQl1cmwpCisg
ICAgY29uc3QgY2hhciAqCQkJdXJsLAorICAgIGdsb2J1c19ncmFtX3Byb3RvY29sX2hhbmRsZV90
CWhhbmRsZSkKKwogewogICAgIGludCByYzsKIApAQCAtNDYxLDYgKzQ2NCwyMSBAQAogICAgIGVs
c2UKICAgICB7CiAJcmMgPSBnbG9idXNfZ3JhbV9qb2JfbWFuYWdlcl9jb250YWN0X3JlbW92ZShy
ZXF1ZXN0LCB1cmwpOworCisgICAgICAgIC8qIEluY2FzZSB3ZSB1bnJlZ2lzdGVyIHRoZSBsYXN0
IGNhbGxiYWNrIGFuZCB3ZSdyZSB3YWl0aW5nCisgICAgICAgICAqIGZvciBUV09fUEhBU0VfRU5E
IGNvbW1pdCwgZmFrZSB0aGUgQ09NTUlUX0VORCBzaWduYWwKKyAgICAgICAgICovCisKKyAgICAg
ICAgaWYgKCFyZXF1ZXN0LT5jbGllbnRfY29udGFjdHMgJiYKKyAgICAgICAgICAgICByZXF1ZXN0
LT5qb2JtYW5hZ2VyX3N0YXRlID09IEdMT0JVU19HUkFNX0pPQl9NQU5BR0VSX1NUQVRFX1RXT19Q
SEFTRV9FTkQpCisgICAgICAgIHsKKyAgICAgICAgICAgIGdsb2J1c19ib29sX3QgcmVwbHk9R0xP
QlVTX1RSVUU7CisgICAgICAgICAgICBjaGFyIGJ1ZlszMl07CisKKyAgICAgICAgICAgIHNucHJp
bnRmKGJ1ZixzaXplb2YoYnVmKSwiJWQiLEdMT0JVU19HUkFNX1BST1RPQ09MX0pPQl9TSUdOQUxf
Q09NTUlUX0VORCk7CisgICAgICAgICAgICBnbG9idXNfbF9ncmFtX2pvYl9tYW5hZ2VyX3NpZ25h
bChyZXF1ZXN0LGJ1ZixoYW5kbGUsJnJlcGx5KTsKKyAgICAgICAgICAgIGdsb2J1c19hc3NlcnQo
cmVwbHkgPT0gR0xPQlVTX1RSVUUpOworICAgICAgICB9CiAgICAgfQogICAgIHJldHVybiByYzsK
IH0KQEAgLTkxOCw5ICs5MzYsNyBAQAogZ2xvYnVzX2xfZ3JhbV9qb2JfbWFuYWdlcl9xdWVyeV92
YWxpZCgKICAgICBnbG9idXNfZ3JhbV9qb2JtYW5hZ2VyX3JlcXVlc3RfdCAqCXJlcXVlc3QpCiB7
Ci0gICAgc3dpdGNoKAotCSAgICAocmVxdWVzdC0+cmVzdGFydF9zdGF0ZSAhPSBHTE9CVVNfR1JB
TV9KT0JfTUFOQUdFUl9TVEFURV9TVEFSVCkKLQkgICAgPyByZXF1ZXN0LT5yZXN0YXJ0X3N0YXRl
IDogcmVxdWVzdC0+am9ibWFuYWdlcl9zdGF0ZSkKKyAgICBzd2l0Y2gocmVxdWVzdC0+am9ibWFu
YWdlcl9zdGF0ZSkKICAgICB7CiAgICAgICBjYXNlIEdMT0JVU19HUkFNX0pPQl9NQU5BR0VSX1NU
QVRFX1NUQVJUOgogICAgICAgY2FzZSBHTE9CVVNfR1JBTV9KT0JfTUFOQUdFUl9TVEFURV9NQUtF
X1NDUkFUQ0hESVI6Cg==
</data>        

          </attachment>
      

    </bug>

</bugzilla>