Bugzilla – Bug 5617
GRAM4 seg hangs with fork jobs
Last modified: 2008-07-18 14:11:38
You need to log in before you can comment on or make changes to this bug.
SEG hangs when the logfile, globus-fork.log, is on NFS. I traced this down to the "fork_starter.c" code. Specifically, the permission of this file is set to 622. The open/write c-code in "fork_starter.c" has a do-while loop over "fcntl" function (see below). If file is mounted on NFS, rc returns as '-1' and errno as 11 (EAGAIN) because it cannot get a write-lock over NFS. (Note, I am running rpc.rstatd on the system) This becomes an infinite loop as all subsequent calls to fcntl return the same result. If I change the file permission to '666', then it proceeds normally. Or if I ignore setting a write lock by changing the code (my simple test program) putting rc=0 into the EAGAIN case, it works fine. below is the loop. -Jeff do { rc = fcntl(logfd, F_SETLKW, &lock); if (rc < 0) { switch (errno) { case EACCES: case EAGAIN: rc = 1; break; case EBADF: globus_assert(errno != EBADF); break; case EDEADLK: globus_assert(errno != EDEADLK); break; case EFAULT: globus_assert(errno != EFAULT); break; case EINTR: rc = 1; break; } } } while (rc == 1);
*** Bug 5620 has been marked as a duplicate of this bug. ***
Thanks for the report Jeff. Joe is off til the end of the month, but we should be able to make this change for 4.0.6. -Stu
Jeff, Joe and I discussed this some. Seems this needs further investigation. I'm removing the 4.0.6 milestone. -Stu
I've put a new version of the globus fork starter in http://www-unix.mcs.anl.gov/~bester/patches/globus_fork_starter-0.4.tar.gz which should detect errors better in this situation better before the job is started and report them. I don't have access to a system that doesn't have working fnctl locks, so I can't verify that this catches errors properly. If this detects the problem for you, we can probably call that program in the setup package to check that the logging file will work in practice.
Any feedback on this patched version?
This fix is committed to 4.2 branch (for 4.2.1) and 4.0 branch (for 4.0.8) and trunk.