accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam J Shook (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4787) Numerous leaked CLOSE_WAIT threads from TabletServer
Date Fri, 26 Jan 2018 17:08:00 GMT


Adam J Shook commented on ACCUMULO-4787:

>From the users list:

I checked all tablet servers across all six of our environments and it
seems to be present in all of them, with some having upwards of 73k

I disabled replication in our dev cluster and restarted the tablet
servers.  Left it running overnight and checked the connections -- a
reasonable number in the single or double digits.  Enabling replication
lead to a quick climb in the CLOSE_WAIT connections to a couple thousand,
leading me to think it is some lingering connection reading a WAL file from

I've opened ACCUMULO-4787
<> to track this and we
can move discussion over there.


On Thu, Jan 25, 2018 at 12:23 PM, Christopher <> wrote:

> Interesting. It's possible we're mishandling an IOException from DFSClient
> or something... but it's also possible there's a bug in DFSClient
> somewhere. I found a few similar issues from the past... some might still
> be not fully resolved:
> The HBASE issue is interesting, because it indicates a new HDFS feature in
> 2.6.4 to clear readahead buffers/sockets (
> jira/browse/HDFS-7694). That might be a feature we're not yet utilizing,
> but it would only work on a newer version of HDFS.
> I would probably also try to grab some jstacks of the tserver, to try to
> figure out what HDFS client code paths are being taken to see where the
> leak might be occurring. Also, if you have any debug logs for the tserver,
> that might help. There might be some DEBUG or WARN items that indicate
> retries or other failures failures that are occurring, but perhaps handled
> improperly.
> It's probably less likely, but it could also be a Java or Linux issue. I
> wouldn't even know where to begin debugging at that level, though, other
> than to check for OS updates.  What JVM are you running?
> It's possible it's not a leak... and these are just getting cleaned up too
> slowly. That might be something that can be tuned with sysctl.
> On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <>
> wrote:
>> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
>> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
>> the connections.  Just now there are ~25k connections for this one tserver,
>> of which 99.9% of them are all writing to various DataNodes on port 50010.
>> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
>> are ESTABLISHED.  No special RPC configuration.
>> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <> wrote:
>>> +1 to looking at the remote end of the socket and see where they're
>>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>>> left in CLOSED_WAIT.
>>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>> (
>>> of-the-tcp-specification/ covers some of the technical details)
>>> On 1/24/18 6:37 PM, Christopher wrote:
>>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>>> Accumulo version you're running. I'm assuming you verified that it was the
>>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>>> confirm based on the port number that these were Thrift connections or
>>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <
>>>> <>> wrote:
>>>>     Hello all,
>>>>     Has anyone come across an issue with a TabletServer occupying a
>>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>>    In one instance, there were over 68k and it was affecting other
>>>>     applications from getting a free port and they would fail to start
>>>>     (which is how we found this in the first place).
>>>>     Thank you,
>>>>     --Adam

> Numerous leaked CLOSE_WAIT threads from TabletServer
> ----------------------------------------------------
>                 Key: ACCUMULO-4787
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>         Environment: * Ubuntu 14.04
> * HDFS 2.6.0 and 2.5.0 (in the middle of an upgrade cycle)
> * ZooKeeper 3.4.6
> * Accumulo 1.8.1
> * HotSpot 1.8.0_121
>            Reporter: Adam J Shook
>            Assignee: Adam J Shook
>            Priority: Major
> I'm running into an issue across all environments where TabletServers are occupying a
large number of ports in a CLOSED_WAIT state writing to a DataNode at port 50010.  I'm seeing
numbers from around 12,000 to 20,000 ports.  In some instances, there were over 68k and
it was affecting other applications from getting a free port and they would fail to start
(which is how we found this in the first place).

This message was sent by Atlassian JIRA

View raw message