hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: File descriptor leak, possibly new in CDH5.7.0
Date Mon, 23 May 2016 16:59:25 GMT
Have you taken a look at HBASE-9393 ?

On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey everyone,
>
> We are noticing a file descriptor leak that is only affecting nodes in our
> cluster running 5.7.0, not those still running 5.3.8. I ran an lsof against
> an affected regionserver, and noticed that there were 10k+ unix sockets
> that are just called "socket", as well as another 10k+ of the form
> "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
> 2 seem related based on how closely the counts match.
>
> We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
> handled the namenode upgrade separately).  The 5.3.8 nodes *do not*
> experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
> more regionservers until we can figure this out. I'm not sure if any
> intermediate versions between the 2 have the issue.
>
> We traced the root cause to a hadoop job running against a basic table:
>
> 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
> MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
> BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
> {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}
>
> This is very similar to all of our other tables (we have many). However,
> it's regions are getting up there in size, 40+gb per region, compressed.
> This has not been an issue for us previously.
>
> The hadoop job is a simple TableMapper job with no special parameters,
> though we haven't updated our client yet to the latest (will do that once
> we finish the server side). The hadoop job runs on a separate hadoop
> cluster, remotely accessing the HBase cluster. It does not do any other
> reads or writes, outside of the TableMapper scans.
>
> Moving the regions off of an affected server, or killing the hadoop job,
> causes the file descriptors to gradually go back down to normal.
>
> Any ideas?
>
> Thanks,
>
> Bryan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message