accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam J. Shook" <adamjsh...@gmail.com>
Subject Re: Large number of used ports from tserver
Date Fri, 26 Jan 2018 17:06:29 GMT
I checked all tablet servers across all six of our environments and it
seems to be present in all of them, with some having upwards of 73k
connections.

I disabled replication in our dev cluster and restarted the tablet
servers.  Left it running overnight and checked the connections -- a
reasonable number in the single or double digits.  Enabling replication
lead to a quick climb in the CLOSE_WAIT connections to a couple thousand,
leading me to think it is some lingering connection reading a WAL file from
HDFS.

I've opened ACCUMULO-4787
<https://issues.apache.org/jira/browse/ACCUMULO-4787> to track this and we
can move discussion over there.

--Adam

On Thu, Jan 25, 2018 at 12:23 PM, Christopher <ctubbsii@apache.org> wrote:

> Interesting. It's possible we're mishandling an IOException from DFSClient
> or something... but it's also possible there's a bug in DFSClient
> somewhere. I found a few similar issues from the past... some might still
> be not fully resolved:
>
> https://issues.apache.org/jira/browse/HDFS-1836
> https://issues.apache.org/jira/browse/HDFS-2028
> https://issues.apache.org/jira/browse/HDFS-6973
> https://issues.apache.org/jira/browse/HBASE-9393
>
> The HBASE issue is interesting, because it indicates a new HDFS feature in
> 2.6.4 to clear readahead buffers/sockets (https://issues.apache.org/
> jira/browse/HDFS-7694). That might be a feature we're not yet utilizing,
> but it would only work on a newer version of HDFS.
>
> I would probably also try to grab some jstacks of the tserver, to try to
> figure out what HDFS client code paths are being taken to see where the
> leak might be occurring. Also, if you have any debug logs for the tserver,
> that might help. There might be some DEBUG or WARN items that indicate
> retries or other failures failures that are occurring, but perhaps handled
> improperly.
>
> It's probably less likely, but it could also be a Java or Linux issue. I
> wouldn't even know where to begin debugging at that level, though, other
> than to check for OS updates.  What JVM are you running?
>
> It's possible it's not a leak... and these are just getting cleaned up too
> slowly. That might be something that can be tuned with sysctl.
>
> On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <adamjshook@gmail.com>
> wrote:
>
>> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
>> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
>> the connections.  Just now there are ~25k connections for this one tserver,
>> of which 99.9% of them are all writing to various DataNodes on port 50010.
>> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
>> are ESTABLISHED.  No special RPC configuration.
>>
>> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>>> +1 to looking at the remote end of the socket and see where they're
>>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>>> left in CLOSED_WAIT.
>>>
>>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>>
>>> (https://blog.cloudflare.com/this-is-strictly-a-violation-
>>> of-the-tcp-specification/ covers some of the technical details)
>>>
>>> On 1/24/18 6:37 PM, Christopher wrote:
>>>
>>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>>> Accumulo version you're running. I'm assuming you verified that it was the
>>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>>> confirm based on the port number that these were Thrift connections or
>>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>>
>>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>>> <mailto:adamjshook@gmail.com>> wrote:
>>>>
>>>>     Hello all,
>>>>
>>>>     Has anyone come across an issue with a TabletServer occupying a
>>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>>    In one instance, there were over 68k and it was affecting other
>>>>     applications from getting a free port and they would fail to start
>>>>     (which is how we found this in the first place).
>>>>
>>>>     Thank you,
>>>>     --Adam
>>>>
>>>>
>>

Mime
View raw message