Bugzilla – Full Text Bug Listing
|Summary:||reliable large run job submissions using condor-g|
|Product:||GRAM||Reporter:||Stuart Martin <email@example.com>|
|Component:||Campaign||Assignee:||Martin Feller <firstname.lastname@example.org>|
|Severity:||normal||CC:||email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org|
|Bug Depends on:||3121, 4643|
Title: Reliable large run job submissions using condor-g Definition: Submitting a large number of jobs (2000+) to WS GRAM using condor-g should complete reliably. In recent tests of WS GRAM 4.0.1 at UNL, problems occurred that prevented all 2000+ job from completing. A number of modifications were made in 4.0.2 to fix errors as well as improve responsiveness to Fork jobs during these large job runs. See bug 4197 for details. However, there were still some errors that prevented all 2000+ jobs from completing. The intent of this campaign is to pick up where 4197 left off and make it so that 2000+ job runs through condor-g will complete reliably. Service host: osg-test1.unl.edu Client host: osg-test2.unl.edu Deliverables: 1) Improved WS GRAM code (possibly including other GT components that WS GRAM depends on - RFT, GridFTP, ...) 2) Web page documenting results from successful large testing runs. Performance/throughput of the runs. Tasks: 1) Install latest condor on service host for use as local RM (Brian Bockelman - UNL) 2) Install latest condor-g on client host (Brian Bockelman - UNL) 3) Install GT 4.0.2 on client and service hosts (Martin Feller - ANL) 4) Submit a series of condor-g-test-io jobs varying the size of the stage in/out files per job. This will include job profile that fits the CRAB application (or the new version of the CRAB application) 5) Analyze/debug/resolve any issues. Write campaigns if larger development is required. 6) Rerun tests until 2000+ jobs runs complete reliably
Software versions: osg-test1: GT-4.0.2, Condor-6.7.19 osg-test2: GT-4.0.1, Condor-6.7.19 The globus version on osg-test2 will not change to 4.0.2, because the changes in the GT from 4.0.1 to 4.0.2 didn't affect globusrun-ws. Problems and details when installing GT-4.0.2 on osg-test1 and running the first tests: 1) Strange behaviour of Condor: When extremely short jobs (like '/bin/sh -c hostname') with output option are submitted to Condor using condor_submit (and not the WS-GRAM interface), no output-file is generated by Condor almost all the time. For jobs that take some time an output-file is created. The same happens of course, if Gram is used to submit jobs to the Condor-Pool: So if you try 'globus-job-run-ws -factory-type Condor -args "-c" "hostname" osg-test1.unl.edu:9443 /bin/sh' as described on http://t2.unl.edu/ws-osg-tests , there will sometimes be no output! 2) globus-job-run-ws and globus-job-run-ws.pl (Perlscripts) are not part of the current GT. I copied them from an old installation. Brian told me that OSG-people want to use this script. 3) The Condor job description generated by the GT container results in all jobs staying in state 'pending' in Condor for 2 reasons: a) The Condor job description contains: Requirements = ... Arch == INTEL But all Intel-nodes were claimed by their users :-( b) There seems to be no shared filesystem between the Condor nodes. Without an additional "when_to_transfer_output = ON_EXIT" in the Condor job description the jobs didn't start running! I adapted the GRAM Condor Perl module $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/condor.pm with respect to those two things and it worked. 4) I ran the first tests two days ago: 500 jobs without file-staging passed. But there were file-staging errors which didn't effect the success of the jobs. Maybe it's the fault of Condor, because they only appeared in the container-log on if the 'Queue'-statement in the condor job description was followed by a number of jobs > 1: Example: Universe = grid Grid_Type = gt4 Jobmanager_Type = Fork GlobusScheduler = osg-test2.unl.edu:9443 Executable = mysleep Output = job_sleep.output Error = job_sleep.error Log = job_sleep.log Queue 5 If I queued the job only once, there were no file-staging errors. I must look at that closer but there was no time because of the following item: 5) osg-test1 has NFS problems at the moment. Starting the condor-daemons crashed the whole system. Brian checks this and will do some basic tests after the problems are solved. Throughput testing can probably continue on Monday.
Here are some updates, in response to Martin's posting last Friday: 1) (No output present on job return). This is due to the jobmanager change you made in (3). The nodes do have a shared file system, and the when_to_transfer_output change broke retrieval of output. I rolled back the jobmanager change 2) (globus-job-run-ws and globus-job-run-ws.pl not in GT) Why aren't these included? They are/were wonderful tools for hand submission and debugging. They are very useful, and I'd recommend to consider re-including them. 3) a) The Condor job description contains: Requirements = ... Arch == INTEL I have to change this on all condor.pm's I install because we have x86_64 nodes, which support both ARCH's. Very annoying, and I'd like to see this fixed in the condor.pm. b) There seems to be no shared filesystem between the Condor nodes. Again, this is not true, and I rolled back the jobmanager change you made. 4) I ran the first tests two days ago: 500 jobs without file-staging passed... Please look at this again, now that we have taken care of (3) 5) osg-test1 has NFS problems at the moment: The NFS problems have been cleared up, and the new NFS server is much more responsive than the old one. I have run through the list of WS tests (with the number of jobs set to 1), and they have all succeeded.
I updated the validation test scripts, as the condor submit files were using a deprecated syntax. From now on, the scripts can be checked out anonymously from the following SVN repository: svn://t2.unl.edu/cms/ValidationTests Contact me separately via email if you would like commit access to this repository (or if you have difficulty accessing it). osg-test2 has an updated version of svn, and osg-test1 will soon have an updated svn. At some point, it would be very helpful if we were able to automate some of these scripts - this should be possible through some clever parsing of the condor log files.
The GRAM client tools aren't in the 4.0.x series because they weren't ready for 4.0.0 and only bug fixes are allowed in point releases. They will be included in 4.2.0 as this is a feature release.
General: ######## All tests are submitted from osg-test2.unl.edu with condor_submit. All jobs are managed in the GT4-Container (4.0.2) on osg-test1.unl.edu and run in a Condor pool. Number of tests with 3500 jobs: 10 so far Each test jobs has 4MB stage-in file transfer 8MB stage-out file transfer Test Results: ############# Number of tests with 3500 jobs: 10 so far success: 5 aborted: 2 failed: 3 Reasons for abortion: --------------------- 1) During the first test with 3500 jobs a lot of java.lang.OutOfMemory Exceptions appeared in the container logfile. The default configuration here is that the container starts with 512MB available RAM. Obviously this is not enough for 3500 jobs. In the following tests the container was started with 1024MB availabe RAM and there were no more OutOfMemory Exceptions. 2) gridftp-server on osg-test2 could not be contacted anymore on port 2811 during the test Reasons for failure: -------------------- 3 times: 1 or 2 jobs of 3500 kept staying in internal state StageInResponse and did not continue. Those jobs still existed on disk in the persisted version and thus seemed to be alive. No errors in the container logfile concerning these jobs. All other jobs finished successfully. 1 time: 1 job submitted by condor-g did not reach the container (3500 jobs submitted by condor_submit, 3499 jobs accepted by the container). Unfortunately the file that logs the job states on the client-side was damaged. so i could not examine the reason for the loss of this job. Possible reason for jobs staying in state StageInResponse sometimes: -------------------------------------------------------------------- A job in state StageInResponse will continue only if a notification from RFT indicates that staging has finished. If this notification is lost the resource will then not be added to the appropriate RunQueue again and keeps on staying in this state. Job State Statistics: ##################### The "busiest" RunQueue is the stageIn-Queue: when all 3500 jobs are accepted by the container up to 1783 jobs are in internal state stageIn ------------------------------------------------------------------- 2006-06-30 09:06:46 ------------------------------------------------------------------- JOB STATES | #JOBS IN THAT STATE ------------------------------------------------------------------- Done:  Submit:  OpenStderr:  FileCleanUpResponse:  WaitingForStateChanges:  Start:  OpenStdout:  CleanUp:  StageInResponse:  StageOutResponse:  StageOut:  MergeStdout:  StageIn:  ------------------------------------------------------------------- TOTAL #JOBS: 3500 ------------------------------------------------------------------- experiments with more RunThreads for the stageIn-queue (5 instead of 3) and higher priority of the RunThreads of this Queue resulted in maximum 1024 jobs in this queue but did NOT result in a better performance. If the stageIn-RunQueue is "full", jobs stay there for hours before they get processed by a RunThread. Conclusion and open questions: ############################## * Tests take a lot of time. I don't know how this could be improved yet. * Critical thing: Jobs staying in internal state StageInResponse. I have to verify, that it's really the notification-loss. If so: Notifications are not reliable, are they? But this should not be a problem of the big amount of jobs. This could also occur with lower number of jobs. * There seem to be no general problem with 3500 jobs as long as enough memory is provided for the container process. No exceptional things seem to happen due to the amount of jobs except for the jobs staying in state StageInResponse * Why is the stageIn-RunQueue the most busiest and why does it take hours until a resource in state stageIn will be processed by a RunThread if the RunQueue is full? Where is the difference to the stageOut-RunQueue, which is not such busy? Is it just the amount of jobs coming in at the same time? Test details: ############# 2000 Jobs: ++++++++++ Test1: ------ Start: 2006-06-26 14:59:59 (first job in internal state 'None') End: 2006-06-26 19:47:24 State: no errors 2500 Jobs: ++++++++++ Test1: ------ Start: 2006-06-27 17:05:23 (first job in internal state 'None') End: 2006-06-27 23:25:03 State: no errors 3500 Jobs ++++++++++ Test1: ------ Result: Test was aborted by me Reason for abortion: a lot of java.lang.OutOfMemory errors in the container logfile Start: 2006-06-27 03:08:52 (first job in internal state 'None') Aborted: 2006-06-27 16:44:21 => in the following tests the container was started with 1024 MB RAM Test2: ------ Result: no errors Start: 2006-06-28 01:46:30 (first job in internal state 'None') End: 2006-06-28 10:53:21 Test3: ------ Result: 2 jobs stayed in internal state "StageInResponse" without change Start: 2006-06-29 10:25:34 (first job in internal state 'None') End: 2006-06-29 19:09:54 Test4: ------ Result: no errors Start: 2006-06-30 05:01:15 (first job in internal state 'None') End: 2006-06-30 13:40:15 Test5: ------ Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: no errors Start: 2006-07-01 05:26:17 (first job in internal state 'None') End: 2006-07-01 14:21:06 Test6: ------ Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: 2 jobs stayed in internal state "StageInResponse" without change Start: 2006-07-02 02:51:42 (first job in internal state 'None') End: 2006-07-02 11:43:49 Test7: ------ Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: test was aborted by me Reason for abortion: Connection to RFT-Service in Container on osg-test2 failed many times Start: 2006-07-03 02:26:01 (first job in internal state 'None') Aborted: 2006-07-03 09:30:34 Test8: ------ Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: no errors Start: 2006-07-03 12:34:36 (first job in internal state 'None') End: 2006-07-03 21:54:26 Test9: ------ Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: no errors Start: 2006-07-04 01:05:44 (first job in internal state 'None') End: 2006-07-04 10:22:10 Test10: ------- Modifications: 5 RunThread for the stageIn-Queue (instead of 3). Each RunThread for the stageIn-Queue with highest priority Result: 3499 jobs were accepted by the container (1 was lost) 1 job stayed in internal state "StageInResponse" without change Start: 2006-07-04 10:48:09 (first job in internal state 'None') End: 2006-07-04 20:07:15
an error that occurs quite often ################################ ERROR service.TransferClient [Thread-47,normalNonExtendedTransfer:871] Exception in transfer ... 500-Unable to open file /home/gpn/.globus/scratch/\ job_2a947a70-0a65-11db-a8e5-fd4d3de5aa81/mysleep This kind of error occurs very often (but only while stageIn) but does not impact the success of a job. I think the reason is the following: I had a look at the RSL created by condor-g and found, that there are four transfer-parts inside of the fileStageIn: ... <fileStageIn> <maxAttempts>5</maxAttempts> <transferCredentialEndpoint> ... </transferCredentialEndpoint> <transfer> <sourceUrl> gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776\ /empty_dir_u1465/ </sourceUrl> <destinationUrl> gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch </destinationUrl> </transfer> <transfer> <sourceUrl> gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776\ /empty_dir_u1465/</sourceUrl> <destinationUrl> gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\ job_8ff6c880-0cc4-11db-b248-a9807d8bba43/ </destinationUrl> </transfer> <transfer> <sourceUrl> gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/\ 3500_jobs_2006_07_06_Mxm1024M/mysleep </sourceUrl> <destinationUrl> gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\ job_8ff6c880-0cc4-11db-b248-a9807d8bba43/mysleep </destinationUrl> </transfer> <transfer> <sourceUrl> gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/\ 3500_jobs_2006_07_06_Mxm1024M/test_input </sourceUrl> <destinationUrl> gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\ job_8ff6c880-0cc4-11db-b248-a9807d8bba43/test_input </destinationUrl> </transfer> </ns2:fileStageIn> ... I assume the error is caused by this following: If the first two transfers are not finished when the third one starts, the directory job_8ff6c880-0cc4-11db-b248-a9807d8bba43 (created during transfer two) does not exist and this will result in the above error. The same error sometimes occurs in transfer 4, but much less frequently. I'm still not quite sure if this can be the reason for some few jobs staying in state stageInResponse, but I don't think so: A notification of an abortive stageIn (like successful stageIn's) will be delivered to the StagingListener which saves the notification messages and adds the Resource to the RunQueue again. The resource will then be processed by StateMachine.processStageInResponseState() where the notication messages of the staging will be processed. In case there was an error during stageIn, the job will be cancelled. So a job should fail, when an error occurs during stageIn (like maxAttempts reached), but should not stay in state stageInResponse forever. Peter or somebody else: correct me if something is wrong with these thoughts! Possible reason why stageIn takes longer than stageOut with Condor-G: ##################################################################### In the RSL created by Condor-G, fileStageOut consists only of 3 transfers (compared to 4 while fileStageIn). This would explain, why stageIn takes longer than stageOut. Also, there is a holdState element in the RSL created by Condor-G which is set to StageIn. I don't know so far what this is for. Less jobs appeared in the Container log than had been submitted by Condor: ########################################################################## During the last two tests 3500 jobs were submitted by Condor-G but only 3499 jobs appeared in the logfile. I assume the reason for this is in Condor: As can be seen from the above RSL-snippet, the Globus Job ID is already part of the RSL. AFAIK this can only be the case, if the job ID itself is created before submission to the container and is added to the CreateManagedJobInputType submitted to the createManagedJob method of the MEJFS. I saw that in the Condor logfile of the 3500 jobs on the client-side 2 Condor jobs were mapped to the same globus job ID (d16256f0-0cd2-11db-a8a8-b3868d4ecca5): 017 (172.316.000) 07/06 08:36:55 Job submitted to Globus RM-Contact: osg-test1.unl.edu:9443 JM-Contact: https://172.16.149.235:9443/wsrf/services/\ ManagedExecutableJobService?d16256f0-0cd2-11db-a8a8-b3868d4ecca5 Can-Restart-JM: 0 ... 017 (172.325.000) 07/06 08:40:10 Job submitted to Globus RM-Contact: osg-test1.unl.edu:9443 JM-Contact: https://172.16.149.235:9443/wsrf/services/\ ManagedExecutableJobService?d16256f0-0cd2-11db-a8a8-b3868d4ecca5 Can-Restart-JM: 0 ... A job request with an already existing globus job ID will not result in the creation of a job in the container. So in this case, one job request will not result in a Globus job. This case is not logged by the container, even not in DEBUG mode. In the second test which failed this time there was a mapping of two job requests to the same globus job ID, too. In both tests, the remaining 3499 jobs finished without errors or hanging in some state. Conclusion: ########### No real answers to the problems now, but some explanations to some ERROR messages and jobs not reaching the container. I posted some of these things to the Condor mailinglist but didn't get an answer so far.
> Also, there is a holdState element in the RSL created by Condor-G which > is set to StageIn. I don't know so far what this is for. Condor-G likes to do two-stage commit submissions. I can't remember the reasoning for it, but putting a hold on fileStageIn means the job will not start doing anything significant until Condor-G comes back with a release call.
Here's what I've noticed in the test last night: 1) The names of the GRAM threads are not unique. That makes it hard to see what's going on across multiple threads of the same type. 2) At the end of the test, there are a few jobs that are stuck in ManagedExecutableJobResource.remove(). They have been stuck there for hours. Whatever the reason they are stuck they should not loop forever. There should be some maximum timeout for it. The problem is when these jobs are looping like that, they are actually using a container thread and that container thread is never returned to the pool (which affects the container ability to serve other requests) 3) If all container threads are stuck like in 2) and since the container does not limit the maximum number of connections, eventually a large request queue can build up. Each request is one file descriptor and eventually all available file descriptors might be used. That could explain all (or most of) the other errors in the container that we see. So one thing to try is to increase the number of file descriptors available to the container and see how things behave.
Some information about the latest tests: we had a period of about 20 tests with 3500 jobs with problems. The problems had been various in nature but can be divided into two categories: 1) filestaging errors (harmless but undesirable) finally we found out that these errors occur because of dependant transfers in the RSL created by Condor-G. RFT didn't keep the transfers in strict order so that some files should be put in a directory that didn't exist so far (because it was created by another transfer). In a second or third try finally all transfers within a job always succeeded. Ravi fixed this some days ago, so we don't have to care about this anymore 2) Mixture of errors that impacted the success of tests: ERROR container.GSIServiceThread [ServiceThread-N,process:145] Error processing request java.net.SocketException: Connection reset ... ERROR exec.StateMachine [RunQueue Other,processCompletedStagingJob:2729] Unable to destroy transfer. AxisFault ... faultString: Proxy file (/tmp/x509up_u2) not found. ... ERROR securemsg.X509SignHandler [RunQueue Other,handleMessage:109] Failed to sign message org.globus.gsi.GlobusCredentialException: Proxy file (/tmp/x509up_u2) not found. ... ERROR container.ServiceDispatcher [ServiceDispacherThread-N,run:230] Error accepting new connection java.net.SocketException: Too many open files ... We switched from 4.0.2 code to gram 4.0.branch code and from that time those problems disappeared and just some problems remained: a) some jobs share the same UUID (set on the client-side by Condor-G using axis.jar) b) sometimes jobs don't seem to get released from stageInHold by Condor-G (round 1 job per 7000 jobs). At least sometimes a job doesn't continue from stageInHold Some days ago Jaime Frey provided a new gridmanager which does not add a holdState element to the RSL. The last tests had been done with this gridmanager without failures. Almost without failure because network problems occured during one test: the gridftp-server on osg-test2 wasn't reachable for some time during a test. We expected improvements of performance of the testruns because of the missing holdState, but there is no significant improvement.
The latest tests ran stable. Altogether 30 tests succeeded without or with little errors which are not caused by the GT container (see below). An average test takes about 11 hours to finish for 3500 jobs with some seconds runtime and with 2MB input data and 2MB ouput data. Relevant informations for jobs running reliably. 1) Start the container process with enough memory: 512 MB is not enough, there will be OutOfMemory exceptions 1024 MB is ok. This can be done by setting $GLOBUS_OPTIONS="-Xmx1024M" Due to the long duration of a testrun (about 11 hours for 3500 jobs) the exact limit was not tested. 2) Use of GT 4.0.branch code (WS-GRAM and RFT) 3) If jobs are submitted through Condor-G, make sure to create a proxy certificate that's just valid for 12 hours, even if the test takes more time (see below). Still existing problems 1) quite rare: two jobs share the same UUID. The generation of the UUID is done on the client-side by Condor-G which uses AXIS to generate the UUID. Somehow the used algorithm does not always generate unique IDs This results in one job not being accepted by the container if those jobs overlap in time. 2) rare: At start of the gridmanager daemon of Condor an empty directory is created on the client-side in /tmp/condor_g_scratch.0x*.* which is used to create directories for jobs on the server via RFT. Sometimes there are errors who indicate that this directory is removed on the client-side. We're sure that it's not a WS-GRAM/RFT error becaus this directory exists only on the client-side but we cannot exactely figure out why this directory is removed. 3) We figured out during performance tests that tests with a longer duration but 12 hours cause problems with credentials and that Condor causes these problems. Jaime will fix this. A workaround in the meantime is to generate a proxy with a validity of 12 hours, even if the test takes longer. In such a case Condor will refresh the proxy before it expires. 4) very rare: Very few jobs seem not to be released from holdState by Condor. Jaime provided a gridmanager that does not add a holdState to RSLs and it works well.