hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: summary of issue/status
Date Mon, 12 Sep 2011 01:17:51 GMT
Sorry for flood of emails. I found this post which seems to describe a
very similar issue to mine, with ClosedChannelException.




I get exactly the same stack trace:


2011-09-11 17:30:27,977 WARN  [IPC Server handler 2 on 60020]
ipc.HBaseServer$Handler(1100): IPC Server handler 2 on 60020 caught:








He mentions that his MR job does Puts. Ours does scan then put in the
reducer. Little different, but still, symptom is identical. He sees the
same problem while major compact in process, which as I mentioned in
previous emails, I initiated a major_compact from the shell several days
ago, but I still see a lot of compaction activity in the regionserver
logs such as the following (and many of the compactions take several


2011-09-11 18:09:23,563 INFO  [regionserver60020.compactor]
regionserver.Store(728): Started compaction of 3 file(s) in cf=V1  into
4269602/.tmp, seqid=125878278, totalSize=58.1m

2011-09-11 18:09:24,009 INFO  [regionserver60020.cacheFlusher]
regionserver.Store(494): Renaming flushed file at
564759f/.tmp/7031901971078401778 to hdfs://

2011-09-11 18:09:24,016 INFO  [regionserver60020.cacheFlusher]
regionserver.Store(504): Added hdfs://
/V1/4520835400045954408, entries=544, sequenceid=125878282,
memsize=25.0m, filesize=24.9m

2011-09-11 18:09:24,022 INFO  [regionserver60020.compactor]
regionserver.Store(737): Completed compaction of 3 file(s), new
/V1/6694303434913089134, size=14.5m; total size for store is 218.7m

2011-09-11 18:09:24,022 INFO  [regionserver60020.compactor]
regionserver.HRegion(781): completed compaction on region
<REDACTED>:3,1315072186065.99d858c926c1e6c05feb638b64269602. after 0sec


However, he observed OOM in regionserver logs. I grepped my logs and
there is no OOM. I also ran "lsof | wc -l" and response is 14000; we are
nowhere near any limits ("ulimit -n" is 100000), so ruled that out.




From: Geoff Hendrey 
Sent: Sunday, September 11, 2011 5:52 PM
To: 'hbase-user@hadoop.apache.org'
Cc: James Ladd; Rohit Nigam; Tony Wang; Parmod Mehta
Subject: summary of issue/status


OK. Here is the summary of what I know:


A region server, after some amount of scanning, can begin to get
ClosedChannelException when it tries to respond to the client.
Unfortunately, this only effects the response to the client. The region
server apparently continues to tell zookeeper and say "I'm alive and
OK". Consequently, the regionserver is never shutdown. This causes the
client to still attempt to access regions on the effectively-dead
server. But each request will eventually time out on the client side,
since all the client sees is "I sent a request, and never receive any
response on the socket.".  However, the client has no capability to
inform the master of the problem. 


If I manually shutdown the region server where the problem exists, the
regions get redistributed other region servers automatically, and then
the client will receive new information about the new location of the
regions, on a different region server, and the client can begin
functioning again. However, the problem will soon reappear on a
different region server.






  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message