Bug 2521 - CAMPAIGN: Throughput Bottleneck Profiling Round 1
: CAMPAIGN: Throughput Bottleneck Profiling Round 1
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: unspecified
: All All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-01-05 16:34 by
Modified: 2005-01-14 12:26 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-01-05 16:34:48
LEADER<bn>
<bn>
Peter Lane<bn>
<bn>
PEOPLE<bn>
<bn>
Peter Lane<bn>
<bn>
TIME ESTIMATE<bn>
<bn>
5 days<bn>
<bn>
DESCRIPTION<bn>
<bn>
As a result of a report by the DiPerf people, there is an indication that a) the
maximum number of simultaneous clients submitting jobs to the GT 3.9.4 GRAM
services for non-staging, fork jobs before failure is about half that of GT2 (69
vs. 115), and b) the maximum number of simultaneous clients in GT 3.9.4 GRAM
services for fork jobs involving staging of output is very poor (8 simultaneous
clients).  An attempt shall be made to profile service timings to find obvious
bottlenecks that are preventing higher simultaneous client numbers, and to find
possible improvements to the code to alleviate the bottlenecks.  This is one
facet of improving general throughput for the GT4 GRAM services.<bn>
<bn>
DELIVERABLES<bn>
<bn>
1) A list of bottleneck points in the service code preventing higher throughput.<bn>
2) Recommended fixes for bottlenecks listed in deliverable #1.<bn>
<bn>
TASKS<bn>
<bn>
1) Collect service timings using output provided by previous instrumentation
campaign while submitting non-staging jobs with the throughput tester.<bn>
2) Modify the throughput tester to be able to submit simple staging (job output)
jobs.<bn>
3) Collect service timings using output provided by previous instrumentation
campaign while submitting staging jobs with the throughput tester.<bn>
4) Analyze service timings to identify bottlenecks and determine possible fixes.<bn>
------- Comment #1 From 2005-01-05 18:28:14 -------
Results from a short, 100-client, non-staging, fork run of the throughput
tester
(sorted from shortest to longest duration):

processNoneState duration average (seconds): 0.006
processStageInState duration average (seconds): 0.006
processFileCleanUpState duration average (seconds): 0.00601111111111111
processStageOutState duration average (seconds): 0.00701111111111111
processDoneState duration average (seconds): 0.0109555555555556
processOpenStdoutState duration average (seconds): 0.011
remove duration average (seconds): 0.0114
processOpenStderrState duration average (seconds): 0.012
processWaitingForStateChangesState duration average (seconds): 0.016
processCleanUpState duration average (seconds): 0.0162777777777778
processStartState duration average (seconds): 0.07
processCacheCleanUpState duration average (seconds): 0.265966666666667
processSubmitState duration average (seconds): 0.564
createManagedJob duration average (seconds): 1.59327777777778

As one can see, the top three time hogs are 1) createManagedJob(), 2) cache
clean up, and 3) submit.

I can come to two basic conclusions from this:
1) createManagedJob is by far the major bottleneck here, and
2) the invocations of perl scripts are slowing things down a little.

I would recommend for starters to push as many sub-tasks in createManagedJob as
possible out of this function (possibly adding one or more internal
initialization states).  This won't decrease overall execution time, but it
should free up container connections more quickly.
------- Comment #2 From 2005-01-06 12:34:51 -------
I bet there is not a lot that can be done in the submit task.  .5 seconds to
call perl, submit job to 
scheduler...

What are the subtasks that go on in the createManagedJob task?  Is that where
the delegated cred is 
getting written?
------- Comment #3 From 2005-01-06 13:10:00 -------
The top level tasks are 1) create the resource, 2) subscribe for notifications
(if requested).  The create() 
call initializes the resource state. This is where all the real crap is done
including creating and writing 
the delegated credential and resolving RSL variables.

I'm wondering if this perl call to get the user's home directory is taking a
lot of time.  It's made about a 
half dozen times. This could be optimized by adding the value to the resource
internal state the first 
time it's determined.  I may hack it to return a hard-coded value just to see
if it improves the creation 
time significantly.
------- Comment #4 From 2005-01-06 13:22:01 -------
Yea - that would be a good test.

Also, we know we want to optimize the delegated file creation to be one file per user/Grid DN, so it 
would be interesting to comment that out and retest to see know how much that costs us.

Or add timings to get a breakdown of all of these tasks in createmanagedjob.
------- Comment #5 From 2005-01-06 14:05:16 -------
I just did the hard-coded user directory hack and collected timings.  Here are
the previous top three 
time hogs:

processSubmitState duration average (seconds): 0.151
processCacheCleanUpState duration average (seconds): 0.0414444444444444
createManagedJob duration average (seconds): 0.355

I reran with the original code to get stats from a similar machine state since
I was confused why the 
other timings were better as well.  I got the following:

processSubmitState duration average (seconds): 0.145
processCacheCleanUpState duration average (seconds): 0.164577777777778
createManagedJob duration average (seconds): 1.58996666666667

I would say that the first two are questionable as to whether the change
helped, but createMangedJob() 
definitely seems to be benefiting from a simulated caching of th user home
value.

It should also be noted that these tests have been running without delegation. 
I'll run it with delegation 
to get a comparison.
------- Comment #6 From 2005-01-06 14:48:36 -------
Delegating (orifinal code) job timings:

processNoneState duration average (seconds): 0.006
processDoneState duration average (seconds): 0.00648888888888889
processStageInState duration average (seconds): 0.007
processStageOutState duration average (seconds): 0.008
processFileCleanUpState duration average (seconds): 0.008
processOpenStderrState duration average (seconds): 0.015
processOpenStdoutState duration average (seconds): 0.015
processWaitingForStateChangesState duration average (seconds): 0.017
processCleanUpState duration average (seconds): 0.07
processCacheCleanUpState duration average (seconds): 0.189922222222222
remove duration average (seconds): 0.429133333333333
processStartState duration average (seconds): 0.554
processSubmitState duration average (seconds): 0.764
createManagedJob duration average (seconds): 8.6342

Delegating (cached user home simulation code) job timings:

processDoneState duration average (seconds): 0.006
processStageOutState duration average (seconds): 0.007
processFileCleanUpState duration average (seconds): 0.007
processCleanUpState duration average (seconds): 0.012
processOpenStdoutState duration average (seconds): 0.012
processOpenStderrState duration average (seconds): 0.013
processWaitingForStateChangesState duration average (seconds): 0.013
processStartState duration average (seconds): 0.017
processNoneState duration average (seconds): 0.047
processStageInState duration average (seconds): 0.051
processCacheCleanUpState duration average (seconds): 0.136
processSubmitState duration average (seconds): 0.221
remove duration average (seconds): 0.337922222222222
createManagedJob duration average (seconds): 8.37098888888889

My guess is that this createManagedJob timming with delegation is the main problem the DiPerf 
streaming job test results were so bad.  Credential reuse would definitely help with throughput where 
delegation is needed.
------- Comment #7 From 2005-01-07 12:15:43 -------
It looks like the delegated credential handling time is an indirect factor of
the poor throughput.  In jobs 
that took over 13 seconds to create, the security initialization was still
under 2 seconds.  It seems to be 
the poor placement of a synchronized block that is primarely hindering
throughput.  The code inside of 
the synchronized block takes under 2 seconds, but timings taken outside of the
synchronized block 
takes close to the full 13 seconds.  This suggests that because so many jobs
are piled up waiting for 
the lock, the least increase in time inside the block ripples up the waiting
stack like a traffic jam caused 
by a "gawkers' road block".

At any rate, after a simple hack (that I may just commit) to test the
hypothesis, the bottlneck shifted to 
registering for credential refreshes.  The throughput also seemed to increase a
bit, but is still being 
held back by the new bottleneck.
------- Comment #8 From 2005-01-07 18:08:08 -------
Here is a formal list of bottlenecks and recommended fixes, completeing the
deliverables for this campaign.


Non-staging job bottlenecks
-----------------------------------------------------------------
1)	Traffic jam caused by the poor utilization of the synchronized block in 
	ManagedExecutableJobHome.
2)	DelegationUtil.registerDelegationListener()
3)	Repeated perl script calls for determining the user's home directory.

Recommended non-staging job bottleneck fixes
-----------------------------------------------------------------
1a)	Shrink the synchronized block by moving the "this.add(key, resource)" up to 
	right after the resource is actually created (and before initialization), and 
	ending the block after that.
2a)	investigate the Delegation Service to see if any bottlenecks can be identified 
	and alleviated to improve registration throughput.
2b)	Implement credential reuse for jobs in the same container and of the same 
	user.
3a)	Cache the user's home directory in the resource's internal state data.

Staging job bottlenecks
-----------------------------------------------------------------
- No significant additional bottlenecks were identified.

Recommended staging job bottleneck fixes
-----------------------------------------------------------------
- No significant additional bottlenecks were identified.
------- Comment #9 From 2005-01-07 19:57:39 -------
About 1a): I am trying to recollect why we put this synchronized block in the
first place,  
since find() and add() are thread-safe. I can't see any reason for atomicity of
this  
critical section... Would be nice if we could keep the code as is, but simply
remove the  
synch() directive. 

Also, same comment for ManagedMultiJobHome's synchronized block in create().  
We might wanna profile that too. 
------- Comment #10 From 2005-01-08 10:25:46 -------
Even if find() and add() are thread safe, a lack of a synchronized block would
allow for two or more 
instances to be created and initialized for the same key.  Presuming a
hashtable is used underneath, 
only one would end up being mapped to the key in the end.  This would waste
time pointlessly, but if 
it's harmeless stability wise and is so rare that the performance hit isn't a
real threat, then it may be a 
moot point.  There is some merrit to having a fully initialized resource before
it's added to the 
hashtable, though.  I guess it's a trade off between possible performance hits
and having the possibility 
of getting an EPR to an empty or partially initialized resource.  Looking at it
from that perspective, I'm 
inclined to agree with Alain.  I'll post an updated set of deliverables with
the change.

And indeed, I was planning on looking at Multi to make sure the same
bottlenecks didn't plague it as 
well when it comes time to actually do the work to relieve them.
------- Comment #11 From 2005-01-08 10:28:32 -------
Updated deliverables (specifically, recommended fix 1a)...

Non-staging job bottlenecks
-----------------------------------------------------------------
1)      Traffic jam caused by the poor utilization of the synchronized block in 
        ManagedExecutableJobHome.
2)      DelegationUtil.registerDelegationListener()
3)      Repeated perl script calls for determining the user's home directory.

Recommended non-staging job bottleneck fixes
-----------------------------------------------------------------
1a)     Remove the synchronized() block since the possible consequences of that appear to be small
        compared to the bottleneck.
2a)     investigate the Delegation Service to see if any bottlenecks can be identified 
        and alleviated to improve registration throughput.
2b)     Implement credential reuse for jobs in the same container and of the same 
        user.
3a)     Cache the user's home directory in the resource's internal state data.

Staging job bottlenecks
-----------------------------------------------------------------
- No significant additional bottlenecks were identified.

Recommended staging job bottleneck fixes
-----------------------------------------------------------------
- No significant additional bottlenecks were identified.
------- Comment #12 From 2005-01-12 17:22:03 -------
Bottleneck #2 was found (via recommendation 2a) to actually be caused by the
fact that the 
registerDelegationListener() code ends up calling the listener's
setCredential() method in the same 
thread.  since user proxy generation occured in setCredential(), this was
causing the throughput to 
appear in the DS.  Since in general this causes a denial of service situation
in the DS, this issue has been 
logged as a bug and fixed.  The performance is still the same, but the
bottleneck has been more 
accurately associated with user proxy key pair generation.  At this point,
recommendation 2b still 
makes sense to avoid unneccesary key pair generation.