hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xu-Feng Mao <m9s...@gmail.com>
Subject The number of fd and CLOSE_WAIT keep increasing.
Date Mon, 22 Aug 2011 10:33:19 GMT
Hi,

We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
got three regionservers have
opened fd and CLOSE_WAIT kept increasing.

It looks like if the lines like

====
2011-08-22 18:19:01,815 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
has too many store files; delaying flush up to 90000ms
2011-08-22 18:19:01,815 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
has too many store files; delaying flush up to 90000ms
====

increase, then the the number of opened fds and CLOSE_WAIT increase
accordingly.

We're not sure if it's kind of fd leak under some unexpected circumstance or
exceptional path.

By netstat -lntp, we found that there're lots of connection like

====
Proto Recv-Q Send-Q Local Address               Foreign Address
State       PID/Program name
tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
CLOSE_WAIT  27748/java
====

The connections are keeping in these situation. It seems like some
connections to hdfs is in the situation
that the hdfs datanode has sent FIN, but regionservers are blocking on the
recv queue, so the fd and CLOSE_WAIT sockets
are probably leaked.

We also see some logs like
====
2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.150.161.73:50010, add to deadNodes and continue
java.io.IOException: Got error in response to OP_READ_BLOCK self=/
10.150.161.64:55229, remote=/10.150.161.73:50010 for file
/hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 for
block 2791681537571770744_132142063
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
        at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
====

The number is much less than the number of " too many store files" WARNs, so
this might not the cause of too many
fds, but is this dangerous to the whole cluster?

Thanks and regards,

Mao Xu-Feng

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message