Bugzilla – Bug 1546
File descriptor leak & potential deadlock
Last modified: 2008-07-18 14:35:07
You need to
before you can comment on or make changes to this bug.
This is a patch for two GRAM problems:
1) There is a file descriptor leak.
2) There is a fix for a potential deadlock problem.
These were developed by David Smith at CERN and tested as part of the VDT.
I will attach the patch separately.
Created an attachment (id=310) [details]
Patch for problem described above
Patch for problem described above
The first problem (FD leak) ought to have been fixed in Globus 2.4. There was a
problem in Globus I/O where failed connects would result in a FD leak, but that
was fixed a long time ago (see bug #237). Can you verify that this problem
exists with a recent version of the code (and identify version what you are
using). A test case sent in with the patches would be excellent.
The deadlock problem has been fixed already. See bug #1566 for details on bug
and the fix I applied.
the first problem obviously was *not* fixed in 2.4!
Look at the original globus_2_4_3 code in the attachment!
We found the leak running the Condor-G "gahp_server" under "valgrind",
so this is not some theoretical issue: it actually was a problem!
Please do apply the fix; it has been carefully tested. Thanks.
Regarding the first problem addressed by this patch (file descriptor leak):
The version being discussed is globus-2.4.3, it's identified in the patch text,
although not directly in the bug comment.
The patch is forcing a close on the connection handle from a failed connect.
(The close was previously conditional on whether it was a failed connect or
whether a subsequent write register failed). Infact there is still some room for
tidying up the handling of the error objects here, although as written it
doesn't cause any problem. (As an asside there are several places in the
existing code where error objects are never released, although given the small
finite size of the globus object cache it don't see that it causes any problems)
In this case I think there is argument for saying that, given the connection
handle should always be closed before globus_l_gram_protocol_connect_callback()
is called with an error, the close is indeed not necessary. However we've seen
that sometimes the failed connection handle is _not_ closed and thus we suffer
from the file descriptor leak. So in the end I made a decision to always attempt
a close when there is an error in this routine, rather than distinguish whether
the error was a connection error or some other error in the callback processing.
If you prefer to enforce the close in the connection logic that would also be
A test case for this could be attempting a connection (using a secure
authentication mode) to a destination that accepts the connection, receives the
token but then drops the connection. I had started to look into the lower level
connection handling, with the intention of including the required patch in some
future suggestions. Indeed I have one solution now, although I haven't convinced
myself that it really cannot cause any existing code to misbehave. I include it
now so that you can follow that up, if you like.
(see attached patch to globus_io_securesocket.c)
Created an attachment (id=328) [details]
Some suggested changes for globus_io_securesocket.c
Some possible changes to ensure that failed connection attempts close the
connection handle before calling the connection callback routine.
I'm redirecting this bug to the I/O guys. You should definitely try without
this patch with the 3.2 beta---Globus I/O has been reworked as a compatibility
layer on top of XIO, so the bugs might be a bit different.
The semantics of not requiring a close after a failed connect will remain. The
fd leak appears to be possible when authentication fails mid way through the
Since 3.2 now uses a globus io compatibility wrapper over globus xio, the
priority for fixing this will be pretty low. As a work around, it does not
hurt to always call globus_io_register_close(). If the handle has already been
destroyed, the call will just fail.
As for error objects not being released, this is the way they're intended to be
used. As you note, the error object cache size is finite, so it's not a real
leak. The system was designed in such a way to allow you to ignore return
codes. If you're not going to do anything with an error object, you have no
reason to clutter your code with get-ing and free-ing the object.