Bugzilla – Bug 5467
Condor-G/Jobmanager race results in truncated stdout/err files
Last modified: 2007-08-13 15:55:48
You need to log in before you can comment on or make changes to this bug.
Jaime Frey's analysis of a problem seen on multiple OSG sites that results in a double stage-out of data, with the second copy improperly being zero bytes long: When stage-in for the job is complete, Condor-G tells the jobmanager to exit. At the same time, the jobmanager learns that the job itself has finished and starts staging the output files back to Condor-G. The jobmanager waits until the staging out is done before exiting. In the mean time, Condor-G receives word that the job is complete and tries to restart the jobmanager (creating a new jobmanager process to replace the one that it thinks is now dead). The new jobmanager realizes that the original jobmanager is still alive, says so to Condor-G and exits. But before exiting, it calls the cache_cleanup perl module callout, which removes job-related files like stdout/err. Condor-G waits for a minute and restarts the jobmanager again. By this time, the original jobmanager has exited, and the new jobmanager process proceeds. It repeats the stage out process, but the files have been deleted and empty files end up being transferred. I see three problems that should be addressed to fix the lost stdout/ err: 1) If a jobmanager that's started to manage an existing job notices that a previous jobmanager for the same job is still alive, it shouldn't delete the job's files. 2) If a jobmanager finishes staging output files successfully and then exits, a new jobmanager for that job shouldn't perform the same transfers again. 3) When Condor-G tells a jobmanager to exit, it should wait at least a few seconds before starting a new jobmanager for that job.
A preliminary patch for this issue is on the dev.globus.org wiki: http://dev.globus.org/images/d/d4/Old-jm-alive.diff It implements something like option 1 in the bug report description. joe
Fix committed to CVS.