hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Limotte <mslimo...@gmail.com>
Subject Re: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException
Date Thu, 07 Jan 2010 08:39:48 GMT
Thanks for the response, Ryan.

I increased ulimt and Xceivers.  My load job still dies, but the
RegionServers stay up and running, and I can use the hbase shell to retrieve
a row, so I guess HBase is still running.

Doesn't seem to be swapping (still 2-3gig free)
CPU usage is low
top - 02:44:02 up 16 min,  1 user,  load average: 0.60, 0.47, 0.37
Tasks: 117 total,   1 running, 116 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.0%us,  0.3%sy,  0.0%ni, 95.9%id,  1.8%wa,  0.0%hi,  0.0%si,
0.1%st
Mem:   7348132k total,  4506192k used,  2841940k free,    18788k buffers
Swap:        0k total,        0k used,        0k free,  2846528k cached

open files for user between 2000 and 2500
ulimit -n is 4096,
/etc/security/limits.conf nofile set to 32768
dfs.datanode.max.xcievers set in $HADOOP_HOME/conf/hdfs-site.xml to 2048

I also should have mentioned that this is running in Amazon EC2. I checked
out the recommendations for EC2 on the wiki, and hence this particular run
is on a 'c1.xlarge' instance, although most of my prior testing has been on
m1.large.


The jobtracker log has a bunch of these errors (each task, except for ones
with a very small input file, fail after a bunch of retries like this):

2010-01-07 02:43:23,719 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201001070230_0001_m_0000
18_0: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
contact region server Some server, retr
yOnlyOne=true, index=0, islastrow=false, tries=9, numtries=10, i=1620,
listsize=3823, region=feed,\x00\xFFn\x
F5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416618 for region
feed,\x00\xFFn\xF5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416
618, row '\x01\x03Uz\x7F\xFF\xFE\xDB\x15\xEEt\xFD', but failed after 10
attempts.
Exceptions:
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1120)
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
        at
org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:69)
        at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.close(MapTask.java:631)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Following are the exceptions I see in the regionserver logs (but they seem
to show up at startup):


java.net.ConnectException: Connection refused        at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input
java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
        at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set watcher on
ZNode /
hbase/masterorg.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/master
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:780)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.watchMasterAddress(ZooKeeperWrapper.java:304)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HRegionServer.java:385)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(HRegionServer.java:315)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionServer.java:306)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:276)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.doMain(HRegionServer.java:2474)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2542)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to set watcher on
ZooKeeper master address. Retrying.

...

2010-01-07 01:49:34,756 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-07 01:49:34,765 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x0 to sun.nio.ch.SelectionKeyImpl@2cba5bdb
java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-07 01:49:34,766 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,767 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
    at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,783 INFO org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be
 reassigned)
2010-01-07 01:49:34,813 INFO org.apache.hadoop.hbase.master.RegionManager:
ROOT inserted into regionsInTransition
2010-01-07 01:49:35,015 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server slave3.cluster1/10.
251.130.176:2181


-marc

2010/1/5 <hbase-user-digest-help@hadoop.apache.org>

>
> ---------- Forwarded message ----------
> From: Ryan Rawson <ryanobjc@gmail.com>
> To: hbase-user@hadoop.apache.org
> Date: Tue, 5 Jan 2010 16:19:49 -0800
> Subject: Re: Seeing errors after loading a fair amount of data.
> KeeperException$NoNodeException, IOException
> Hey,
>
> What we need is the fatal exception in the Regionserver log... just
> looking though I suspect you might be running into HDFS tuning limits.
>  Xciever count and ulimit -n are the key settings you want to verify.
>
> Look in:
>
> http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description
>
> Let us know!
> -ryan
>
> On Tue, Jan 5, 2010 at 4:06 PM, Marc Limotte <mslimotte@gmail.com> wrote:
> > I'm struggling with a problem that seems to manifest only after I load a
> > fair amount of data.  I run a few map/reduce jobs and load about 80M
> > rows successfully. Then I start another process to load another 35M or so
> > rows, and things start breaking:
> >
> > - Most of the RegionServer processes die (4 out of 5) -- log message
> below.
> > - HMaster does not die, but seems unresponsive at the status web page
> (port
> > 60030) -- log message below
> > - HQuorumPeer(s) are still running
> >
> >
> > I restart the entire cluster (full reboot) and try again, but the problem
> > occurs again immediately (similar process state and log messages).
> >
> > It seems that if I truncate the table and restart, I get a similar
> > situation.  *I.e. I can load about 80M rows, but then the RegionServers
> die
> > and my jobs fail. *
> >
> > Small cluster: 5 nodes
> > 2 x 2 cores, 8gig memory, Fedora 8
> > Hbase 0.20.2 / Hadoop 0.20.2
> >
> > ---------
> >
> > *Master log contains errors like:*
> >
> > 2010-01-05 17:33:05,142 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > slave2.cluster1,60020,1262730096922 znode expired
> > 2010-01-05 17:33:05,147 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> > server slave2.cluster1,60020,1262730096922: logSplit: false,
> rootRescanned:
> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2010-01-05 17:33:05,156 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 9 hlog(s) in
> >
> hdfs://master.cluster1:9000/hbase/.logs/slave2.cluster1,60020,1262730096922
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> > Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> File
> > /hbase/feed/248946612/oldlogfile.log could only be replicated to 0 nodes,
> > instead of 1
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1267)
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
> >        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> >        at org.apache.hadoop.ipc.Client.call(Client.java:739)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2906)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2788)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> >
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block null bad datanode[0] nodes == null
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Could not
> get
> > block locations. Source file "/hbase/feed/248946612/oldlogfile.log" -
> > Aborting...
> >
> >
> > *And the region server logs contain this right after start up:*
> >
> > java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:332)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddress(ZooKeeperWrapper.java:318)
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readRootRegionLocation(ZooKeeperWrapper.java:231)
> >  at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:442)
> > at java.lang.Thread.run(Thread.java:619)
> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server
> > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> >  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:328)
> > ... 4 more
> >
> > *and*
> >
> > 2010-01-05 18:49:57,398 WARN org.apache.hadoop.hbase.regionserver.Store:
> > Exception processing reconstruction log
> > hdfs://master.cluster1:9000/hbase/feed/1281791924/oldlogfile.log opening
> > comments -- continuing.  Probably lack-of-HADOOP-1700 causing DATA LOSS!
> > java.io.EOFException
> >        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >        at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.doReconstructionLog(Store.java:318)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.runReconstructionLog(Store.java:267)
> >        at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:225)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1500)
> >        at
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:305)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1621)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1588)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1508)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > This last seems particularly strange, b/c HADOOP-1700 was fixed in Hadoop
> > 0.19.
> >
> > Any help on what these exceptions mean and what I can about them would be
> > appreciated.
> >
> > -Marc
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message