Bugzilla – Bug 2521
CAMPAIGN: Throughput Bottleneck Profiling Round 1
Last modified: 2005-01-14 12:26:13
You need to log in before you can comment on or make changes to this bug.
LEADER<bn> <bn> Peter Lane<bn> <bn> PEOPLE<bn> <bn> Peter Lane<bn> <bn> TIME ESTIMATE<bn> <bn> 5 days<bn> <bn> DESCRIPTION<bn> <bn> As a result of a report by the DiPerf people, there is an indication that a) the maximum number of simultaneous clients submitting jobs to the GT 3.9.4 GRAM services for non-staging, fork jobs before failure is about half that of GT2 (69 vs. 115), and b) the maximum number of simultaneous clients in GT 3.9.4 GRAM services for fork jobs involving staging of output is very poor (8 simultaneous clients). An attempt shall be made to profile service timings to find obvious bottlenecks that are preventing higher simultaneous client numbers, and to find possible improvements to the code to alleviate the bottlenecks. This is one facet of improving general throughput for the GT4 GRAM services.<bn> <bn> DELIVERABLES<bn> <bn> 1) A list of bottleneck points in the service code preventing higher throughput.<bn> 2) Recommended fixes for bottlenecks listed in deliverable #1.<bn> <bn> TASKS<bn> <bn> 1) Collect service timings using output provided by previous instrumentation campaign while submitting non-staging jobs with the throughput tester.<bn> 2) Modify the throughput tester to be able to submit simple staging (job output) jobs.<bn> 3) Collect service timings using output provided by previous instrumentation campaign while submitting staging jobs with the throughput tester.<bn> 4) Analyze service timings to identify bottlenecks and determine possible fixes.<bn>
Results from a short, 100-client, non-staging, fork run of the throughput tester (sorted from shortest to longest duration): processNoneState duration average (seconds): 0.006 processStageInState duration average (seconds): 0.006 processFileCleanUpState duration average (seconds): 0.00601111111111111 processStageOutState duration average (seconds): 0.00701111111111111 processDoneState duration average (seconds): 0.0109555555555556 processOpenStdoutState duration average (seconds): 0.011 remove duration average (seconds): 0.0114 processOpenStderrState duration average (seconds): 0.012 processWaitingForStateChangesState duration average (seconds): 0.016 processCleanUpState duration average (seconds): 0.0162777777777778 processStartState duration average (seconds): 0.07 processCacheCleanUpState duration average (seconds): 0.265966666666667 processSubmitState duration average (seconds): 0.564 createManagedJob duration average (seconds): 1.59327777777778 As one can see, the top three time hogs are 1) createManagedJob(), 2) cache clean up, and 3) submit. I can come to two basic conclusions from this: 1) createManagedJob is by far the major bottleneck here, and 2) the invocations of perl scripts are slowing things down a little. I would recommend for starters to push as many sub-tasks in createManagedJob as possible out of this function (possibly adding one or more internal initialization states). This won't decrease overall execution time, but it should free up container connections more quickly.
I bet there is not a lot that can be done in the submit task. .5 seconds to call perl, submit job to scheduler... What are the subtasks that go on in the createManagedJob task? Is that where the delegated cred is getting written?
The top level tasks are 1) create the resource, 2) subscribe for notifications (if requested). The create() call initializes the resource state. This is where all the real crap is done including creating and writing the delegated credential and resolving RSL variables. I'm wondering if this perl call to get the user's home directory is taking a lot of time. It's made about a half dozen times. This could be optimized by adding the value to the resource internal state the first time it's determined. I may hack it to return a hard-coded value just to see if it improves the creation time significantly.
Yea - that would be a good test. Also, we know we want to optimize the delegated file creation to be one file per user/Grid DN, so it would be interesting to comment that out and retest to see know how much that costs us. Or add timings to get a breakdown of all of these tasks in createmanagedjob.
I just did the hard-coded user directory hack and collected timings. Here are the previous top three time hogs: processSubmitState duration average (seconds): 0.151 processCacheCleanUpState duration average (seconds): 0.0414444444444444 createManagedJob duration average (seconds): 0.355 I reran with the original code to get stats from a similar machine state since I was confused why the other timings were better as well. I got the following: processSubmitState duration average (seconds): 0.145 processCacheCleanUpState duration average (seconds): 0.164577777777778 createManagedJob duration average (seconds): 1.58996666666667 I would say that the first two are questionable as to whether the change helped, but createMangedJob() definitely seems to be benefiting from a simulated caching of th user home value. It should also be noted that these tests have been running without delegation. I'll run it with delegation to get a comparison.
Delegating (orifinal code) job timings: processNoneState duration average (seconds): 0.006 processDoneState duration average (seconds): 0.00648888888888889 processStageInState duration average (seconds): 0.007 processStageOutState duration average (seconds): 0.008 processFileCleanUpState duration average (seconds): 0.008 processOpenStderrState duration average (seconds): 0.015 processOpenStdoutState duration average (seconds): 0.015 processWaitingForStateChangesState duration average (seconds): 0.017 processCleanUpState duration average (seconds): 0.07 processCacheCleanUpState duration average (seconds): 0.189922222222222 remove duration average (seconds): 0.429133333333333 processStartState duration average (seconds): 0.554 processSubmitState duration average (seconds): 0.764 createManagedJob duration average (seconds): 8.6342 Delegating (cached user home simulation code) job timings: processDoneState duration average (seconds): 0.006 processStageOutState duration average (seconds): 0.007 processFileCleanUpState duration average (seconds): 0.007 processCleanUpState duration average (seconds): 0.012 processOpenStdoutState duration average (seconds): 0.012 processOpenStderrState duration average (seconds): 0.013 processWaitingForStateChangesState duration average (seconds): 0.013 processStartState duration average (seconds): 0.017 processNoneState duration average (seconds): 0.047 processStageInState duration average (seconds): 0.051 processCacheCleanUpState duration average (seconds): 0.136 processSubmitState duration average (seconds): 0.221 remove duration average (seconds): 0.337922222222222 createManagedJob duration average (seconds): 8.37098888888889 My guess is that this createManagedJob timming with delegation is the main problem the DiPerf streaming job test results were so bad. Credential reuse would definitely help with throughput where delegation is needed.
It looks like the delegated credential handling time is an indirect factor of the poor throughput. In jobs that took over 13 seconds to create, the security initialization was still under 2 seconds. It seems to be the poor placement of a synchronized block that is primarely hindering throughput. The code inside of the synchronized block takes under 2 seconds, but timings taken outside of the synchronized block takes close to the full 13 seconds. This suggests that because so many jobs are piled up waiting for the lock, the least increase in time inside the block ripples up the waiting stack like a traffic jam caused by a "gawkers' road block". At any rate, after a simple hack (that I may just commit) to test the hypothesis, the bottlneck shifted to registering for credential refreshes. The throughput also seemed to increase a bit, but is still being held back by the new bottleneck.
Here is a formal list of bottlenecks and recommended fixes, completeing the deliverables for this campaign. Non-staging job bottlenecks ----------------------------------------------------------------- 1) Traffic jam caused by the poor utilization of the synchronized block in ManagedExecutableJobHome. 2) DelegationUtil.registerDelegationListener() 3) Repeated perl script calls for determining the user's home directory. Recommended non-staging job bottleneck fixes ----------------------------------------------------------------- 1a) Shrink the synchronized block by moving the "this.add(key, resource)" up to right after the resource is actually created (and before initialization), and ending the block after that. 2a) investigate the Delegation Service to see if any bottlenecks can be identified and alleviated to improve registration throughput. 2b) Implement credential reuse for jobs in the same container and of the same user. 3a) Cache the user's home directory in the resource's internal state data. Staging job bottlenecks ----------------------------------------------------------------- - No significant additional bottlenecks were identified. Recommended staging job bottleneck fixes ----------------------------------------------------------------- - No significant additional bottlenecks were identified.
About 1a): I am trying to recollect why we put this synchronized block in the first place, since find() and add() are thread-safe. I can't see any reason for atomicity of this critical section... Would be nice if we could keep the code as is, but simply remove the synch() directive. Also, same comment for ManagedMultiJobHome's synchronized block in create(). We might wanna profile that too.
Even if find() and add() are thread safe, a lack of a synchronized block would allow for two or more instances to be created and initialized for the same key. Presuming a hashtable is used underneath, only one would end up being mapped to the key in the end. This would waste time pointlessly, but if it's harmeless stability wise and is so rare that the performance hit isn't a real threat, then it may be a moot point. There is some merrit to having a fully initialized resource before it's added to the hashtable, though. I guess it's a trade off between possible performance hits and having the possibility of getting an EPR to an empty or partially initialized resource. Looking at it from that perspective, I'm inclined to agree with Alain. I'll post an updated set of deliverables with the change. And indeed, I was planning on looking at Multi to make sure the same bottlenecks didn't plague it as well when it comes time to actually do the work to relieve them.
Updated deliverables (specifically, recommended fix 1a)... Non-staging job bottlenecks ----------------------------------------------------------------- 1) Traffic jam caused by the poor utilization of the synchronized block in ManagedExecutableJobHome. 2) DelegationUtil.registerDelegationListener() 3) Repeated perl script calls for determining the user's home directory. Recommended non-staging job bottleneck fixes ----------------------------------------------------------------- 1a) Remove the synchronized() block since the possible consequences of that appear to be small compared to the bottleneck. 2a) investigate the Delegation Service to see if any bottlenecks can be identified and alleviated to improve registration throughput. 2b) Implement credential reuse for jobs in the same container and of the same user. 3a) Cache the user's home directory in the resource's internal state data. Staging job bottlenecks ----------------------------------------------------------------- - No significant additional bottlenecks were identified. Recommended staging job bottleneck fixes ----------------------------------------------------------------- - No significant additional bottlenecks were identified.
Bottleneck #2 was found (via recommendation 2a) to actually be caused by the fact that the registerDelegationListener() code ends up calling the listener's setCredential() method in the same thread. since user proxy generation occured in setCredential(), this was causing the throughput to appear in the DS. Since in general this causes a denial of service situation in the DS, this issue has been logged as a bug and fixed. The performance is still the same, but the bottleneck has been more accurately associated with user proxy key pair generation. At this point, recommendation 2b still makes sense to avoid unneccesary key pair generation.