hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Put MR job.. Regionservers crashes..
Date Wed, 04 Aug 2010 18:45:57 GMT
The first part of the log is usually a client that died while it was
doing a request.

The second part of the log is a session expiration. The log fragment
is too small to tell if it was the region server that paused or the ZK
ensemble that was unreachable during that time... are the zk servers
collocated with region servers?

About your side question, TOF can be used as the output of a map and
it will be just like using the HTable, but I'm not sure how's that
going to play with your streaming job.

About "Were the RS's too busy to ping the ZK quorum?", apart from
looking at the log and seeing a pause or not, only you can tell if you
are overcommitting your ressources. Setup ganglia, look at the
metrics, see if the CPUs are too busy, make sure there's 0kb in the
swap (or just disable it altogether).

J-D

On Wed, Aug 4, 2010 at 11:27 AM, Vidhyashankar Venkataraman
<vidhyash@yahoo-inc.com> wrote:
> I am trying to run an update (inserts and modify operations) experiment on a 100 TB  existing
database..
>
> The job is a map-only job that doesn't use TableOutputFormat (since mappers cannot possibly
know the split anyways, so I cant optimize on locality here of course unless I customize it
on my own.. I will be shortly doing that..)..
>  I have a program that generates Puts on existing rows and new inserts within a specified
range.. I use hadoop streaming to run the program as individual tasks on specified ranges..
>  (A side question is if it is better to use TableOutputFormat with reduce jobs: the
cost of sort-and-shuffle operations seem to be the same in both the cases...
>
>  In my case, shuffle will happen anyways.. Sorts will happen in each region server..
The maps are at coarser shard ranges: 6 maps per node as against 170 regions per node.. )
>
> Existing setup:
> 180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs disabled..
4 gig regions.. 12 MB write buffer size in each client.. Single column family and single value
of 28 KB for each row: no compression in the expts (will soon be doing expts with compressed
data)..
>
> Results: Around 130 region servers crashed.. All of them had logs of the following form:
The first set of data isnt the root for crashes.. They seem to be put errors because of a
possible socket timeout.. Is that right?
> The second set of logs below show the point where it crashed.. It looks like a zk timeout..
Were the RS's too busy to ping the ZK quorum?
>
>
> Thank you in advance :)
> Vidhya
>
>
> The logs:
>
>
> 2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction requested for region DocData,0000000149473989,1280881832403/571301273 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction requested for region DocData,0000000148937993,1280881832423/196751683 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction requested for region DocData,0000000147751972,1280881832465/1982268997 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN org.apache.hadoop.ipc.HBaseServer:
IPC Server Responder, call put([B@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9)
from 63.250.207.23:37600: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder,
call put([B@460ca540, [Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388:
output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder,
call put([B@631e978e, [Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020:
output error
> 2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder,
call put([B@685060b9, [Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316:
output error
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 87
on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)      
  at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
       at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 86
on 60020 caught: java.nio.channels.ClosedChannelException        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)      
  at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
       at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 85
on 60020 caught: java.nio.channels.ClosedChannelException
>        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
>
>
> THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...
>
>
>
> 2010-08-04 09:30:29,512 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
>        at java.lang.Thread.run(Thread.java:619)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@3d2292cd
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@36f0b7f8
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,685 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper event, state: Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got ZooKeeper event, state:
Disconnected, type: None, path:
> null
> 2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to
server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected
local=/63.250.207.82:40417 remo
> te=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:29,901 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher:
Got ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x32a206703ca0616 to sun.nio.ch.SelectionKeyImpl@7935c2a8
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
       at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)      
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x32a206703ca0616
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for
session: 0x32a206703ca0616
> 2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn
for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session: 0x32a206703ca0616
closed
> 2010-08-04 09:30:29,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed
connection with ZooKeeper
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to
server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected
local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
> 2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x32a206703ca0014 to sun.nio.ch.SelectionKeyImpl@3dc37e55
> java.io.IOException: Session Expired
>        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:30,291 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:30,291 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper
session expired
> 2010-08-04 09:30:30,292 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump
of metrics: request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58, memstoreSize=1724,
compactionQueueSize=63, usedHeap=1877, maxHeap=7999, blockCacheSize=20643688, blockCacheFree=2495761816,
blockCacheCount=0, blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
>
>
>

Mime
View raw message