Bug 4506

Summary: reliable large run job submissions using condor-g
Product: GRAM Reporter: Stuart Martin <smartin@mcs.anl.gov>
Component: CampaignAssignee: Martin Feller <feller@mcs.anl.gov>
Status: CLOSED FIXED    
Severity: normal CC: bbockelm@math.unl.edu, bester@mcs.anl.gov, childers@mcs.anl.gov, feller@mcs.anl.gov, gawor@mcs.anl.gov, lane@mcs.anl.gov, madduri@mcs.anl.gov, ranantha@mcs.anl.gov, rohder@mcs.anl.gov, smartin@mcs.anl.gov
Priority: P3    
Version: 4.0.2   
Target Milestone: 4.0.3   
Hardware: Macintosh   
OS: All   
Bug Depends on: 3121, 4643    
Bug Blocks: 4050    

Description From 2006-06-12 10:08:17
Title:

Reliable large run job submissions using condor-g

Definition:

Submitting a large number of jobs (2000+) to WS GRAM using condor-g should
complete reliably.  In recent tests of WS GRAM 4.0.1 at UNL, problems occurred
that prevented all 2000+ job from completing.  A number of modifications were
made in 4.0.2 to fix errors as well as improve responsiveness to Fork jobs
during these large job runs.  See bug 4197 for details.  However, there were
still some errors that prevented all 2000+ jobs from completing.  The intent of
this campaign is to pick up where 4197 left off and make it so that 2000+ job
runs through condor-g will complete reliably.

Service host: osg-test1.unl.edu
Client host: osg-test2.unl.edu

Deliverables:

1) Improved WS GRAM code (possibly including other GT components that WS GRAM
depends on - RFT, GridFTP, ...)
2) Web page documenting results from successful large testing runs. 
Performance/throughput of the runs.

Tasks:

1) Install latest condor on service host for use as local RM (Brian Bockelman -
UNL)
2) Install latest condor-g on client host (Brian Bockelman - UNL)
3) Install GT 4.0.2 on client and service hosts (Martin Feller - ANL)
4) Submit a series of condor-g-test-io jobs varying the size of the stage
in/out files per job.  This will include job profile that fits the CRAB
application (or the new version of the CRAB application)
5) Analyze/debug/resolve any issues.  Write campaigns if larger development is
required.
6) Rerun tests until 2000+ jobs runs complete reliably
------- Comment #1 From 2006-06-16 10:33:21 -------
Software versions:

osg-test1: GT-4.0.2, Condor-6.7.19
osg-test2: GT-4.0.1, Condor-6.7.19

The globus version on osg-test2 will not change to 4.0.2, because the
changes in the GT from 4.0.1 to 4.0.2 didn't affect globusrun-ws.

Problems and details when installing GT-4.0.2 on osg-test1 and running the
first tests:

1) Strange behaviour of Condor: When extremely short jobs
   (like '/bin/sh -c hostname') with output option are submitted  to Condor
   using condor_submit (and not the WS-GRAM interface), no output-file is
   generated by Condor almost all the time.
   For jobs that take some time an output-file is created.
   The same happens of course, if Gram is used to submit jobs to the
   Condor-Pool:
   So if you try 'globus-job-run-ws -factory-type Condor -args "-c" "hostname"
osg-test1.unl.edu:9443 /bin/sh'
   as described on http://t2.unl.edu/ws-osg-tests , there will sometimes be no
output!

2) globus-job-run-ws and globus-job-run-ws.pl (Perlscripts) are not part of the
current GT.
   I copied them from an old installation.
   Brian told me that OSG-people want to use this script.

3) The Condor job description generated by the GT container results in all jobs
staying in
   state 'pending' in Condor for 2 reasons:
   a) The Condor job description contains: Requirements = ... Arch == INTEL
      But all Intel-nodes were claimed by their users :-(
   b) There seems to be no shared filesystem between the Condor nodes.
      Without an additional "when_to_transfer_output = ON_EXIT" in the Condor
job description
      the jobs didn't start running!

   I adapted the GRAM Condor Perl module
   $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/condor.pm
   with respect to those two things and it worked.

4) I ran the first tests two days ago: 500 jobs without file-staging passed.
   But there were file-staging errors which didn't effect the success of the
jobs.
   Maybe it's the fault of Condor, because they only appeared in the
container-log on
   if the 'Queue'-statement in the condor job description was followed by a
number
   of jobs > 1:
   Example:

     Universe        = grid
     Grid_Type       = gt4
     Jobmanager_Type = Fork
     GlobusScheduler = osg-test2.unl.edu:9443
     Executable      = mysleep
     Output          = job_sleep.output
     Error           = job_sleep.error
     Log             = job_sleep.log
     Queue 5

   If I queued the job only once, there were no file-staging errors.
   I must look at that closer but there was no time because of the
   following item:

5) osg-test1 has NFS problems at the moment.
   Starting the condor-daemons crashed the whole system.
   Brian checks this and will do some basic tests after the problems are
solved.
   Throughput testing can probably continue on Monday.
------- Comment #2 From 2006-06-20 14:51:14 -------
Here are some updates, in response to Martin's posting last Friday:

1) (No output present on job return).  This is due to the jobmanager change you
made in (3).  The nodes do have a shared file system, and the
when_to_transfer_output change broke retrieval of output.  I rolled back the
jobmanager change

2) (globus-job-run-ws and globus-job-run-ws.pl not in GT)  Why aren't these
included?  They are/were wonderful tools for hand submission and debugging. 
They are very useful, and I'd recommend to consider re-including them.

3) a) The Condor job description contains: Requirements = ... Arch == INTEL

   I have to change this on all condor.pm's I install because we have x86_64
nodes, which support both ARCH's.  Very annoying, and I'd like to see this
fixed  in the condor.pm.

   b) There seems to be no shared filesystem between the Condor nodes.

   Again, this is not true, and I rolled back the jobmanager change you made.   

4) I ran the first tests two days ago: 500 jobs without file-staging passed...

  Please look at this again, now that we have taken care of (3)

5) osg-test1 has NFS problems at the moment:

  The NFS problems have been cleared up, and the new NFS server is much more
responsive than the old one.

I have run through the list of WS tests (with the number of jobs set to 1), and
they have all succeeded.
------- Comment #3 From 2006-06-20 15:04:16 -------
I updated the validation test scripts, as the condor submit files were using a
deprecated syntax.  From now on, the scripts can be checked out anonymously
from the following SVN repository:

svn://t2.unl.edu/cms/ValidationTests

Contact me separately via email if you would like commit access to this
repository (or if you have difficulty accessing it).  osg-test2 has an updated
version of svn, and osg-test1 will soon have an updated svn.  

At some point, it would be very helpful if we were able to automate some of
these scripts - this should be possible through some clever parsing of the
condor log files.
------- Comment #4 From 2006-06-20 15:18:20 -------
The GRAM client tools aren't in the 4.0.x series because they weren't ready for
4.0.0 and only bug fixes are allowed in point releases. They will be included
in 4.2.0 as this is a feature release.
------- Comment #5 From 2006-07-05 03:59:21 -------
General:
########

All tests are submitted from osg-test2.unl.edu with condor_submit.
All jobs are managed in the GT4-Container (4.0.2) on osg-test1.unl.edu
and run in a Condor pool.

Number of tests with 3500 jobs: 10 so far

Each test jobs has
  4MB stage-in file transfer
  8MB stage-out file transfer



Test Results:
#############

Number of tests with 3500 jobs: 10 so far
success: 5
aborted: 2
failed:  3

Reasons for abortion:
---------------------
1) During the first test with 3500 jobs a lot of java.lang.OutOfMemory
   Exceptions appeared in the container logfile.
   The default configuration here is that the container starts with
   512MB available RAM. Obviously this is not enough for 3500 jobs.
   In the following tests the container was started with 1024MB availabe
   RAM and there were no more OutOfMemory Exceptions.
2) gridftp-server on osg-test2 could not be contacted anymore on port 2811
   during the test


Reasons for failure:
--------------------
3 times: 1 or 2 jobs of 3500 kept staying in internal state StageInResponse
     and did not continue.
     Those jobs still existed on disk in the persisted version and thus
     seemed to be alive.
     No errors in the container logfile concerning these jobs.
     All other jobs finished successfully.
1 time:  1 job submitted by condor-g did not reach the container 
         (3500 jobs submitted by condor_submit, 3499 jobs accepted by the
     container). Unfortunately the file that logs the job states on the
     client-side was damaged. so i could not examine the reason for the
     loss of this job.


Possible reason for jobs staying in state StageInResponse sometimes:
--------------------------------------------------------------------
A job in state StageInResponse will continue only if a notification from
RFT indicates that staging has finished. If this notification is lost the
resource will then not be added to the appropriate RunQueue again and keeps
on staying in this state.




Job State Statistics:
#####################

The "busiest" RunQueue is the stageIn-Queue:
when all 3500 jobs are accepted by the container up to 1783 jobs are
in internal state stageIn

-------------------------------------------------------------------
2006-06-30 09:06:46
-------------------------------------------------------------------
            JOB STATES | #JOBS IN THAT STATE
-------------------------------------------------------------------
                  Done: [1542]
                Submit:    [1]
            OpenStderr:   [10]
   FileCleanUpResponse:   [22]
WaitingForStateChanges:   [31]
                 Start:   [13]
            OpenStdout:    [2]
               CleanUp:   [18]
       StageInResponse:   [36]
      StageOutResponse:   [16]
              StageOut:   [10]
           MergeStdout:   [16]
               StageIn: [1783]
-------------------------------------------------------------------
           TOTAL #JOBS: 3500
-------------------------------------------------------------------


experiments with more RunThreads for the stageIn-queue (5 instead of 3)
and higher priority of the RunThreads of this Queue resulted in maximum 1024
jobs in this queue but did NOT result in a better performance.

If the stageIn-RunQueue is "full", jobs stay there for hours before they
get processed by a RunThread.



Conclusion and open questions:
##############################

* Tests take a lot of time. I don't know how this could be improved yet.
* Critical thing:
    Jobs staying in internal state StageInResponse.
    I have to verify, that it's really the notification-loss.
    If so: Notifications are not reliable, are they?
    But this should not be a problem of the big amount of jobs.
    This could also occur with lower number of jobs.
* There seem to be no general problem with 3500 jobs as long as enough
  memory is provided for the container process.
  No exceptional things seem to happen due to the amount of jobs except for
  the jobs staying in state StageInResponse
* Why is the stageIn-RunQueue the most busiest and why does it take hours
  until a resource in state stageIn will be processed by a RunThread
  if the RunQueue is full?
  Where is the difference to the stageOut-RunQueue, which is not such busy?
  Is it just the amount of jobs coming in at the same time?




Test details:
#############


2000 Jobs:
++++++++++

Test1:
------
Start: 2006-06-26 14:59:59 (first job in internal state 'None')
End:   2006-06-26 19:47:24
State: no errors




2500 Jobs:
++++++++++

Test1:
------
Start:    2006-06-27 17:05:23 (first job in internal state 'None')
End:    2006-06-27 23:25:03
State:    no errors




3500 Jobs
++++++++++

Test1:
------
Result: Test was aborted by me
    Reason for abortion:
        a lot of java.lang.OutOfMemory errors in the
        container logfile
Start:     2006-06-27 03:08:52 (first job in internal state 'None')
Aborted: 2006-06-27 16:44:21


=> in the following tests the container was started with 1024 MB RAM 


Test2:
------
Result:    no errors
Start:    2006-06-28 01:46:30 (first job in internal state 'None')
End:    2006-06-28 10:53:21


Test3:
------
Result:    2 jobs stayed in internal state "StageInResponse" without change
Start:    2006-06-29 10:25:34 (first job in internal state 'None')
End:    2006-06-29 19:09:54


Test4:
------
Result:    no errors
Start:    2006-06-30 05:01:15 (first job in internal state 'None')
End:    2006-06-30 13:40:15


Test5:
------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result:    no errors
Start:    2006-07-01 05:26:17 (first job in internal state 'None')
End:    2006-07-01 14:21:06


Test6:
------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result:    2 jobs stayed in internal state "StageInResponse" without change
Start:    2006-07-02 02:51:42 (first job in internal state 'None')
End:    2006-07-02 11:43:49


Test7:
------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result: test was aborted by me
Reason for abortion:
    Connection to RFT-Service in Container on osg-test2 failed
    many times
Start:     2006-07-03 02:26:01 (first job in internal state 'None')
Aborted: 2006-07-03 09:30:34


Test8:
------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result:    no errors
Start:    2006-07-03 12:34:36 (first job in internal state 'None')
End:    2006-07-03 21:54:26


Test9:
------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result:    no errors
Start:    2006-07-04 01:05:44 (first job in internal state 'None')
End:    2006-07-04 10:22:10


Test10:
-------
Modifications:
    5 RunThread for the stageIn-Queue (instead of 3).
    Each RunThread for the stageIn-Queue with highest priority
Result: 3499 jobs were accepted by the container (1 was lost)
    1 job stayed in internal state "StageInResponse" without change
Start:    2006-07-04 10:48:09 (first job in internal state 'None')
End:    2006-07-04 20:07:15
------- Comment #6 From 2006-07-07 09:00:22 -------
an error that occurs quite often
################################

 ERROR service.TransferClient [Thread-47,normalNonExtendedTransfer:871]
 Exception in transfer
 ...
 500-Unable to open file /home/gpn/.globus/scratch/\
 job_2a947a70-0a65-11db-a8e5-fd4d3de5aa81/mysleep


 This kind of error occurs very often (but only while stageIn) but does
 not impact the success of a job.
 I think the reason is the following:
 I had a look at the RSL created by condor-g and found, that there are
 four transfer-parts inside of the fileStageIn:

...    
<fileStageIn>
  <maxAttempts>5</maxAttempts>
  <transferCredentialEndpoint>
      ...
  </transferCredentialEndpoint>

  <transfer>
      <sourceUrl>
        gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776\
             /empty_dir_u1465/
      </sourceUrl>
      <destinationUrl>
        gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch
      </destinationUrl>
  </transfer>

  <transfer>
      <sourceUrl>
        gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776\
             /empty_dir_u1465/</sourceUrl>
      <destinationUrl>
        gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\
             job_8ff6c880-0cc4-11db-b248-a9807d8bba43/
      </destinationUrl>
  </transfer>

  <transfer>
      <sourceUrl>
        gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/\
              3500_jobs_2006_07_06_Mxm1024M/mysleep
      </sourceUrl>
      <destinationUrl>
        gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\
             job_8ff6c880-0cc4-11db-b248-a9807d8bba43/mysleep
      </destinationUrl>
  </transfer>

  <transfer>
      <sourceUrl>
        gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/\
                 3500_jobs_2006_07_06_Mxm1024M/test_input
      </sourceUrl>
      <destinationUrl>
        gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/\
              job_8ff6c880-0cc4-11db-b248-a9807d8bba43/test_input
     </destinationUrl>
  </transfer>

</ns2:fileStageIn>
...

 I assume the error is caused by this following:
 If the first two transfers are not finished when the third one starts,
 the directory job_8ff6c880-0cc4-11db-b248-a9807d8bba43 (created during
 transfer two) does not exist and this will result in the above error.
 The same error sometimes occurs in transfer 4, but much less frequently.

 I'm still not quite sure if this can be the reason for some few jobs
 staying in state stageInResponse, but I don't think so:
 A notification of an abortive stageIn (like successful stageIn's) will
 be delivered to the StagingListener which saves the notification messages
 and adds the Resource to the RunQueue again.
 The resource will then be processed by
 StateMachine.processStageInResponseState()
 where the notication messages of the staging will be processed. In case
 there was an error during stageIn, the job will be cancelled.

 So a job should fail, when an error occurs during stageIn (like maxAttempts
 reached), but should not stay in state stageInResponse forever.

 Peter or somebody else: correct me if something is wrong with these thoughts!


 Possible reason why stageIn takes longer than stageOut with Condor-G:
 #####################################################################

 In the RSL created by Condor-G, fileStageOut consists only of 3 transfers
 (compared to 4 while fileStageIn).
 This would explain, why stageIn takes longer than stageOut.
 Also, there is a holdState element in the RSL created by Condor-G which
 is set to StageIn. I don't know so far what this is for.


 Less jobs appeared in the Container log than had been submitted by Condor:
 ##########################################################################

 During the last two tests 3500 jobs were submitted by Condor-G but only
 3499 jobs appeared in the logfile.

 I assume the reason for this is in Condor:
 As can be seen from the above RSL-snippet, the Globus Job ID is already
 part of the RSL. AFAIK this can only be the case, if the job ID itself
 is created before submission to the container and is added to the
 CreateManagedJobInputType submitted to the createManagedJob method of
 the MEJFS.
 I saw that in the Condor logfile of the 3500 jobs on the client-side
 2 Condor jobs were mapped to the same globus job ID
 (d16256f0-0cd2-11db-a8a8-b3868d4ecca5):

017 (172.316.000) 07/06 08:36:55 Job submitted to Globus
    RM-Contact: osg-test1.unl.edu:9443
    JM-Contact: https://172.16.149.235:9443/wsrf/services/\
         ManagedExecutableJobService?d16256f0-0cd2-11db-a8a8-b3868d4ecca5
    Can-Restart-JM: 0
...

017 (172.325.000) 07/06 08:40:10 Job submitted to Globus
    RM-Contact: osg-test1.unl.edu:9443
    JM-Contact: https://172.16.149.235:9443/wsrf/services/\
          ManagedExecutableJobService?d16256f0-0cd2-11db-a8a8-b3868d4ecca5
    Can-Restart-JM: 0
...

 A job request with an already existing globus job ID will not result in the
 creation of a job in the container. So in this case, one job request will
 not result in a Globus job.
 This case is not logged by the container, even not in DEBUG mode.

 In the second test which failed this time there was a mapping of two
 job requests to the same globus job ID, too. 

 In both tests, the remaining 3499 jobs finished without errors or hanging
 in some state.


 Conclusion:
 ###########

 No real answers to the problems now, but some explanations to some
 ERROR messages and jobs not reaching the container.
 I posted some of these things to the Condor mailinglist but didn't get an
 answer so far.
------- Comment #7 From 2006-07-07 09:51:06 -------
> Also, there is a holdState element in the RSL created by Condor-G which
> is set to StageIn. I don't know so far what this is for.

Condor-G likes to do two-stage commit submissions. I can't remember the
reasoning for it, but putting a hold on fileStageIn means the job will not
start doing anything significant until Condor-G comes back with a release call.
------- Comment #8 From 2006-07-19 12:34:42 -------
Here's what I've noticed in the test last night:

1) The names of the GRAM threads are not unique. That makes it hard to see
what's going on across multiple threads of the same type.

2) At the end of the test, there are a few jobs that are stuck in
ManagedExecutableJobResource.remove(). They have been stuck there for hours.
Whatever the reason they are stuck they should not loop forever. There should
be some maximum timeout for it. The problem is when these jobs are looping like
that, they are actually using a container thread and that container thread is
never returned to the pool (which affects the container ability to serve other
requests)

3) If all container threads are stuck like in 2) and since the container does
not limit the maximum number of connections, eventually a large request queue
can build up. Each request is one file descriptor and eventually all available
file descriptors might be used. That could explain all (or most of) the other
errors in the container that we see.

So one thing to try is to increase the number of file descriptors available to
the container and see how things behave.
------- Comment #9 From 2006-08-07 03:45:27 -------
Some information about the latest tests:

we had a period of about 20 tests with 3500 jobs with problems.
The problems had been various in nature but can be divided into
two categories:

1) filestaging errors (harmless but undesirable)

   finally we found out that these errors occur because of
   dependant transfers in the RSL created by Condor-G.
   RFT didn't keep the transfers in strict order so that
   some files should be put in a directory that didn't exist
   so far (because it was created by another transfer).
   In a second or third try finally all transfers within a job
   always succeeded.

   Ravi fixed this some days ago, so we don't have to care
   about this anymore

2) Mixture of errors that impacted the success of tests:     

   ERROR container.GSIServiceThread [ServiceThread-N,process:145]
   Error processing request
   java.net.SocketException: Connection reset
   ...

   ERROR exec.StateMachine [RunQueue Other,processCompletedStagingJob:2729]
   Unable to destroy transfer.
   AxisFault
   ... faultString: Proxy file (/tmp/x509up_u2) not found.
   ...

   ERROR securemsg.X509SignHandler [RunQueue Other,handleMessage:109]
   Failed to sign message
   org.globus.gsi.GlobusCredentialException:
   Proxy file (/tmp/x509up_u2) not found.
   ...

   ERROR container.ServiceDispatcher [ServiceDispacherThread-N,run:230]
   Error accepting new connection
   java.net.SocketException: Too many open files
   ...


We switched from 4.0.2 code to gram 4.0.branch code and from that time those
problems disappeared and just some problems remained:

a) some jobs share the same UUID (set on the client-side by Condor-G
   using axis.jar)
b) sometimes jobs don't seem to get released from stageInHold by Condor-G
   (round 1 job per 7000 jobs). At least sometimes a job doesn't continue
   from stageInHold


Some days ago Jaime Frey provided a new gridmanager which does not add
a holdState element to the RSL. The last tests had been done with this
gridmanager without failures. Almost without failure because network
problems occured during one test: the gridftp-server on osg-test2 wasn't
reachable for some time during a test.
We expected improvements of performance of the testruns because of the missing
holdState, but there is no significant improvement.
------- Comment #10 From 2006-08-18 07:59:22 -------
The latest tests ran stable. Altogether 30 tests succeeded without or
with little errors which are not caused by the GT container (see below).
An average test takes about 11 hours to finish for 3500 jobs with some seconds
runtime and with 2MB input data and 2MB ouput data.

Relevant informations for jobs running reliably.

1) Start the container process with enough memory:
   512 MB is not enough, there will be OutOfMemory exceptions
   1024 MB is ok.
   This can be done by setting $GLOBUS_OPTIONS="-Xmx1024M"
   Due to the long duration of a testrun (about 11 hours for 3500 jobs)
   the exact limit was not tested.

2) Use of GT 4.0.branch code (WS-GRAM and RFT)

3) If jobs are submitted through Condor-G, make sure to create a proxy
   certificate that's just valid for 12 hours, even if the test takes more
   time (see below).


Still existing problems

1) quite rare: two jobs share the same UUID. The generation of the UUID is
   done on the client-side by Condor-G which uses AXIS to generate the UUID.
   Somehow the used algorithm does not always generate unique IDs
   This results in one job not being accepted by the container if those jobs
   overlap in time.

2) rare: At start of the gridmanager daemon of Condor an empty directory is
   created on the client-side in /tmp/condor_g_scratch.0x*.* which is used to
   create directories for jobs on the server via RFT.
   Sometimes there are errors who indicate that this directory is removed
   on the client-side.
   We're sure that it's not a WS-GRAM/RFT error becaus this directory exists
   only on the client-side but we cannot exactely figure out why this
   directory is removed.

3) We figured out during performance tests that tests with a longer
   duration but 12 hours cause problems with credentials and that Condor
   causes these problems.
   Jaime will fix this.
   A workaround in the meantime is to generate a proxy with a validity of 12
   hours, even if the test takes longer. In such a case Condor will refresh
   the proxy before it expires.

4) very rare: Very few jobs seem not to be released from holdState by Condor.
   Jaime provided a gridmanager that does not add a holdState to RSLs and
   it works well.
------- Comment #11 From 2006-11-13 11:26:33 -------
Setting milestone 4.0.3.  This was included in that release.