accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Large number of used ports from tserver
Date Thu, 25 Jan 2018 17:23:10 GMT
Interesting. It's possible we're mishandling an IOException from DFSClient
or something... but it's also possible there's a bug in DFSClient
somewhere. I found a few similar issues from the past... some might still
be not fully resolved:

https://issues.apache.org/jira/browse/HDFS-1836
https://issues.apache.org/jira/browse/HDFS-2028
https://issues.apache.org/jira/browse/HDFS-6973
https://issues.apache.org/jira/browse/HBASE-9393

The HBASE issue is interesting, because it indicates a new HDFS feature in
2.6.4 to clear readahead buffers/sockets (
https://issues.apache.org/jira/browse/HDFS-7694). That might be a feature
we're not yet utilizing, but it would only work on a newer version of HDFS.

I would probably also try to grab some jstacks of the tserver, to try to
figure out what HDFS client code paths are being taken to see where the
leak might be occurring. Also, if you have any debug logs for the tserver,
that might help. There might be some DEBUG or WARN items that indicate
retries or other failures failures that are occurring, but perhaps handled
improperly.

It's probably less likely, but it could also be a Java or Linux issue. I
wouldn't even know where to begin debugging at that level, though, other
than to check for OS updates.  What JVM are you running?

It's possible it's not a leak... and these are just getting cleaned up too
slowly. That might be something that can be tuned with sysctl.

On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <adamjshook@gmail.com> wrote:

> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
> the connections.  Just now there are ~25k connections for this one tserver,
> of which 99.9% of them are all writing to various DataNodes on port 50010.
> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
> are ESTABLISHED.  No special RPC configuration.
>
> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> +1 to looking at the remote end of the socket and see where they're
>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>> left in CLOSED_WAIT.
>>
>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>
>> (
>> https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
>> covers some of the technical details)
>>
>> On 1/24/18 6:37 PM, Christopher wrote:
>>
>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>> Accumulo version you're running. I'm assuming you verified that it was the
>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>> confirm based on the port number that these were Thrift connections or
>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>
>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>> <mailto:adamjshook@gmail.com>> wrote:
>>>
>>>     Hello all,
>>>
>>>     Has anyone come across an issue with a TabletServer occupying a
>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>    In one instance, there were over 68k and it was affecting other
>>>     applications from getting a free port and they would fail to start
>>>     (which is how we found this in the first place).
>>>
>>>     Thank you,
>>>     --Adam
>>>
>>>
>

Mime
View raw message