Bug 4197

Summary: WS GRAM integration in OSG 0.4.x and 0.6.x
Product: GRAM Reporter: Stuart Martin <smartin@mcs.anl.gov>
Component: CampaignAssignee: Peter Lane <lane@mcs.anl.gov>
Status: RESOLVED FIXED    
Severity: major CC: bester@mcs.anl.gov, childers@mcs.anl.gov, gawor@mcs.anl.gov, jfrey@cs.wisc.edu, lane@mcs.anl.gov, madduri@mcs.anl.gov, ranantha@mcs.anl.gov
Priority: P3    
Version: 4.0.1   
Target Milestone: 4.0.2   
Hardware: Macintosh   
OS: All   
Bug Depends on: 2397, 3685, 4192, 4329, 4330, 4331    
Bug Blocks: 4050    

Description From 2006-02-03 16:29:21
Title:

WS GRAM integration in OSG 0.4.x and 0.6.x


Definition:

GT 4.0.1 WS GRAM is in VDT and subsequently in OSG 0.4.0.  The WS GRAM service is not deployed and 
available by default in OSG 0.4.0.  It is available as an optional deployment.  In order to test WS GRAM 
sufficiently for OSG's liking, Frank Wuerthwein has tasked Brian Bockelman (Student, U of Nebraska) to 
install OSG 0.4.0 deploy WS GRAM for testing and evaluation.  The plan is to test condor-g submitting 
real OSG application with a real workload to the deployed WS GRAM service.  Possibly try out alternate 
service configurations (i.e. gridftp running on a separate service host).  After testing and evaluation 
process has completed successfully, other OSG sites might be asked to deploy WS GRAM too.  After 
successful use from these deployments, WS GRAM can then be considered as a required OSG service 
(meaning it will be deployed by default).  OSG will continue to deploy Pre-WS GRAM as well.

VDT OSG target release dates
================

OSG 0.6
----------
July 15 is the target date for the OSG 0.6 release
June 1 is the date when final release testing will begin
May 1 is when the set of required services are decided and ITB testing begins

upshot - WS GRAM needs to be ready to go by May 1 to make 0.6.0.

OSG 0.4.1
------------
April 1-15 Release of OSG 0.4.1 (based on VDT 1.3.10)
March 15th Release of VDT 1.3.10

upshot - WS GRAM needs to be ready to go by Feb 15th to make 0.4.1 as a required service.

Deliverables:

1) Approved/certified version of WS GRAM for OSG (coming from GT 4.0 community branch)
2) Web page documenting performance results from testing/evalulation

Tasks:

1) Support Nebraska for any installation questions/issues
2) Support Nebraska during the testing and evaluation period
3) Analyze/debug/resolve any issues
4) Make improvements as necessary
------- Comment #1 From 2006-03-10 16:37:48 -------
Here are the things we improved so far in working with large CRAB runs at UNL:

1) Fixed a recovery bug.
2) Updated to the latest WS-GRAM globus_4_0_branch code. This improved error
reporting making it easier to diagnose problems.
3) Updated to the latest RFT globus_4_0_branch code. This fixed some problems
with out-of-order transfers and hung GridFTP control channel problems.
4) Improved GRAM job queue utilization.  This allows for faster response times
for simple fork jobs. It also improves performance in general because relatively
fast events aren't stuck behind relatively slow staging events.
5) Implemented local transport to RFT. This improved container responsiveness
since it essentially removed RFT callouts from the list of total connection
attempts at any particular time, leaving Condor-G with more available threads to
submit jobs.

Current status:

CRAB runs up to about 2000 jobs are completeing for the most part. Occasionally
a job is left unsubmitted or is held because of an error.

CRAB runs above 2000 jobs aren't fairing so well. There's an issue where jobs
seem to be lost in the state machine. I can't find any direct evidence (perhaps
partly because the log files get so big it's hard to find anything out of the
ordinary), but this may be a delegation problem. Close to the end of the run
(i.e. when no more activity is observed in the container), there are a lot of
errors being generated pertaining to delegated credential resources that can't
be found. This is most likely because Condor-G failed to set the lifetime of the
delegated credentials to a long enough period of time to accomodate the long
execution time of the CRAB run.

Once the Condor guys fix this then we can resume testing to see if that helped
stability. If it does then we need to figure out a better way to handle the
situation so that it's easier to identify and doesn't simply result in lost
jobs. If not, obviously we'll have to keep looking for the cause.
------- Comment #2 From 2006-03-21 11:31:27 -------
I fixed a bug that seemed to eliminate the problem of jobs seemingly
disappearing from the state machine. In fact, there was evidence of this in the
form of a debug message rather than an error or warning. This is unrelated
afterall to the expiring delegation issue.

Jaime Frey said that he is testing the latest Condor-G code to make sure it's
doing what it's supposed to in terms of refreshing the delegated credential.
Meanwhile, Brian is installing the latest Condor-G release available since
fixes that may have an affect on the delegated credential issue are in the
latest release. We should have some results by weeks end at the latest.
------- Comment #3 From 2006-03-30 11:35:23 -------
I'm adding bug #3121 as a dependency since a large number of jobs seem to be
failing due to problems connecting to GridFTP. This bug pertains to the broken
check on the maximum number of transfers that are active at any one time.
Hopefully limiting the number of transfers will reduce the traffic between the
two machines involved and prevent connection timeouts.
------- Comment #4 From 2006-03-30 16:25:54 -------
There were condor-g issues in 6.7.14 with refreshing a delegation for  
long CRAB runs (> 12 hours I think).  This was hoped to be fixed with  
a new version of condor-g 6.7.17, but Brian backed that out as it  
caused problems with pre-ws gram users on the same machine ([condor-admin
#13508]).  Jaime is helping Brian to setup 2 condor installs - 1 for pre-ws
jobs and one for experimental use like this ws gram testing.

At the same time, the CRAB application is no longer stable and cannot  
be used for testing at the moment and, according to Brian, maybe not for a  
while.  We are stepping back to a dummy stage-in-sleep-stage-out job  
(condor-g-ws-test-sleep-io).  Initial testing with a 4 MB input and  
10MB output file produced RFT/gridftp errors.  We will reduce the  
file sizes and see where the breaking point is and then report back.  During 
this run there were also condor job execution errors that may have been caused
by NFS.  Brian has changed an NFS setting, hopefully avoiding the problem.  
We'll see.

Here are the current action items:

1) Brian: Get Condor-G version 6.7.17 running just for ws-gram jobs  
on osg-test2.
2) Peter: Adjust the input file size to 0.5 MB for condor-g-ws-test- 
sleep-io test.
3) Peter: Adjust the output file size to 1 MB for condor-g-ws-test- 
sleep-io test.
4) Peter: Run a 3500-job test run with condor-g-ws-test-sleep-io.

2 & 3 are necessary because of the RFT bug (#3121) that prevents throttling
the number of active transfers. After getting the above test to  
work, we can worry about getting RFT patched and find a reasonable  
setting to avoid the GridFTP server timeouts.
------- Comment #5 From 2006-03-30 18:04:54 -------
Tasks 1 through 3 of the latest action list are done (thanks Brian!). I'm
running a 500-job test right now just to shake everything down. If that runs
cleanly I'll bump it up to 3500.
------- Comment #6 From 2006-04-28 12:03:38 -------
I haven't had a chance to track down problems I was having with larger runs.
Here's the the email I wrote regarding this problem from March 31st:

>> I was able to run a 500-job test, but the 3500-job test had  
>> problems. It
>> got hung up at one point waiting for the fileCleanUp transfer request
>> before being destroyed. I looked into it and if I'm not mistaken it
>> looks like RFT lost the request. Here's the line where GRAM registers
>> the transfer request with the RFT notification listener thread:
>>
>> 2006-03-31 12:41:49,653 DEBUG exec.StagingListener [RunQueue
>> FileCleanUp,registerTransferJob:104]
>> [execJobKey:72193bb0-c0b9-11da-9560-fa7a7cd61e10,transferJobKey: 
>> 150037]
>> Leaving registerTransferJob()
>>
>> From this I determined that the request ID is  
>> "150037" ("transferJobKey"
>> is my name for the RFT request id--don't worry, I checked for  
>> sanity's
>> sake that this wasn't a transfer ID too). I then did a search in the
>> MySQL database for this request:
>>
>> mysql> select * from transfer where request_id=150037;
>> Empty set (0.12 sec)
>>
>> mysql> select * from request where id=150037;
>> Empty set (0.00 sec)
>>
>> RFT doesn't appear to delete the database records, so even if the
>> request was destroyed the record should still be there. That said, I
>> also don't see an deliver() calls for that request. Unfortunately I
>> didn't have debugging turned on for RFT, so I'll have to restart the
>> test.
>>
>> Peter
------- Comment #7 From 2006-06-07 11:49:05 -------
Hello

I have talked to Stu Martin who said follow up to this problem since April 28th
- see the bottom of this Bugzilla report - has been superceded by a higher
priority project. Stu also said that he hopes to get some effort available in
the next month or so.

Is this a correct assessment? When I read the last posting on this Bugzilla
thread I thought there would be continuing follow up and was worried that OSG
had failed to be avialable as needed.

Thank you

Ruth
------- Comment #8 From 2006-06-12 10:13:16 -------
I am marking this campaign as closed.  Another campaign bug 4506 has been
created to focus the continuing effort in this area.

-Stu
------- Comment #9 From 2006-06-15 08:46:29 -------
I'm increasing the priority of the replacement campaign aimed at this issue,
4050.
------- Comment #10 From 2006-06-15 09:17:26 -------
Oops! I mean the priority of campaign 4506 is being increased!