hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gibbon, Robert, VF-Group" <Robert.Gib...@vodafone.com>
Subject RE: Seeing errors after loading a fair amount of data. KeeperException$NoNodeException, IOException
Date Thu, 07 Jan 2010 13:04:58 GMT
Maybe you are running Red Hat? Just changing limits.conf I think won't
work because RH has a maximum total open files across the whole system,
which is 4096 by default, unless you do something like this too

echo "32768" > /proc/sys/fs/file-max
service network restart

To make it permanent edit /etc/sysctl.conf to include the line:
      fs.file-max = 32768

Maybe you already did that though, or you're using something else, or
maybe this is not your problem.

Kind regards, 
Robert

-----Original Message-----
From: Marc Limotte [mailto:mslimotte@gmail.com] 
Sent: Donnerstag, 7. Januar 2010 09:40
To: hbase-user@hadoop.apache.org
Subject: Re: Seeing errors after loading a fair amount of data.
KeeperException$NoNodeException, IOException

Thanks for the response, Ryan.

I increased ulimt and Xceivers.  My load job still dies, but the
RegionServers stay up and running, and I can use the hbase shell to
retrieve a row, so I guess HBase is still running.

Doesn't seem to be swapping (still 2-3gig free) CPU usage is low top -
02:44:02 up 16 min,  1 user,  load average: 0.60, 0.47, 0.37
Tasks: 117 total,   1 running, 116 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.0%us,  0.3%sy,  0.0%ni, 95.9%id,  1.8%wa,  0.0%hi,  0.0%si,
0.1%st
Mem:   7348132k total,  4506192k used,  2841940k free,    18788k buffers
Swap:        0k total,        0k used,        0k free,  2846528k cached

open files for user between 2000 and 2500 ulimit -n is 4096,
/etc/security/limits.conf nofile set to 32768 dfs.datanode.max.xcievers
set in $HADOOP_HOME/conf/hdfs-site.xml to 2048

I also should have mentioned that this is running in Amazon EC2. I
checked out the recommendations for EC2 on the wiki, and hence this
particular run is on a 'c1.xlarge' instance, although most of my prior
testing has been on m1.large.


The jobtracker log has a bunch of these errors (each task, except for
ones with a very small input file, fail after a bunch of retries like
this):

2010-01-07 02:43:23,719 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_201001070230_0001_m_0000
18_0: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
to contact region server Some server, retr yOnlyOne=true, index=0,
islastrow=false, tries=9, numtries=10, i=1620, listsize=3823,
region=feed,\x00\xFFn\x
F5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416618 for region
feed,\x00\xFFn\xF5\x7F\xFF\xFE\xDA\xFDLi\xA4,1262714416
618, row '\x01\x03Uz\x7F\xFF\xFE\xDB\x15\xEEt\xFD', but failed after 10
attempts.
Exceptions:
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.pro
cess(HConnectionManager.java:1120)
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBa
tchOfRows(HConnectionManager.java:1201)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
        at
org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close
(TableOutputFormat.java:69)
        at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.close(MapTask.
java:631)
        at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Following are the exceptions I see in the regionserver logs (but they
seem to show up at startup):


java.net.ConnectException: Connection refused        at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
        at
sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,431 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
        at
sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
        at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set
watcher on ZNode /
hbase/masterorg.apache.zookeeper.KeeperException$ConnectionLossException
:
KeeperErrorCode = ConnectionLoss for /hbase/master
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:780)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.watchMasterAddress(Zo
oKeeperWrapper.java:304)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HR
egionServer.java:385)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper
(HRegionServer.java:315)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionS
erver.java:306)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.
java:276)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
ccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
tructorAccessorImpl.java:27)
        at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.doMain(HRegionServer.
java:2474)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.ja
va:2542)
2010-01-07 02:30:38,544 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to set
watcher on ZooKeeper master address. Retrying.

...

2010-01-07 01:49:34,756 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-07 01:49:34,765 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x0 to sun.nio.ch.SelectionKeyImpl@2cba5bdb
java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
    at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-07 01:49:34,766 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown input
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,767 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
    at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
    at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
    at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2010-01-07 01:49:34,783 INFO
org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be
 reassigned)
2010-01-07 01:49:34,813 INFO
org.apache.hadoop.hbase.master.RegionManager:
ROOT inserted into regionsInTransition
2010-01-07 01:49:35,015 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server slave3.cluster1/10.
251.130.176:2181


-marc

2010/1/5 <hbase-user-digest-help@hadoop.apache.org>

>
> ---------- Forwarded message ----------
> From: Ryan Rawson <ryanobjc@gmail.com>
> To: hbase-user@hadoop.apache.org
> Date: Tue, 5 Jan 2010 16:19:49 -0800
> Subject: Re: Seeing errors after loading a fair amount of data.
> KeeperException$NoNodeException, IOException Hey,
>
> What we need is the fatal exception in the Regionserver log... just 
> looking though I suspect you might be running into HDFS tuning limits.
>  Xciever count and ulimit -n are the key settings you want to verify.
>
> Look in:
>
> http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#
> overview_description
>
> Let us know!
> -ryan
>
> On Tue, Jan 5, 2010 at 4:06 PM, Marc Limotte <mslimotte@gmail.com>
wrote:
> > I'm struggling with a problem that seems to manifest only after I 
> > load a fair amount of data.  I run a few map/reduce jobs and load 
> > about 80M rows successfully. Then I start another process to load 
> > another 35M or so rows, and things start breaking:
> >
> > - Most of the RegionServer processes die (4 out of 5) -- log message
> below.
> > - HMaster does not die, but seems unresponsive at the status web 
> > page
> (port
> > 60030) -- log message below
> > - HQuorumPeer(s) are still running
> >
> >
> > I restart the entire cluster (full reboot) and try again, but the 
> > problem occurs again immediately (similar process state and log
messages).
> >
> > It seems that if I truncate the table and restart, I get a similar 
> > situation.  *I.e. I can load about 80M rows, but then the 
> > RegionServers
> die
> > and my jobs fail. *
> >
> > Small cluster: 5 nodes
> > 2 x 2 cores, 8gig memory, Fedora 8
> > Hbase 0.20.2 / Hadoop 0.20.2
> >
> > ---------
> >
> > *Master log contains errors like:*
> >
> > 2010-01-05 17:33:05,142 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > slave2.cluster1,60020,1262730096922 znode expired
> > 2010-01-05 17:33:05,147 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process 
> > shutdown of server slave2.cluster1,60020,1262730096922: logSplit: 
> > false,
> rootRescanned:
> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2010-01-05 17:33:05,156 INFO
org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 9 hlog(s) in
> >
> hdfs://master.cluster1:9000/hbase/.logs/slave2.cluster1,60020,12627300
> 96922
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> > Exception: org.apache.hadoop.ipc.RemoteException:
java.io.IOException:
> File
> > /hbase/feed/248946612/oldlogfile.log could only be replicated to 0 
> > nodes, instead of 1
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
> (FSNamesystem.java:1267)
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java
> :422)
> >        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> >        at org.apache.hadoop.ipc.Client.call(Client.java:739)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> >        at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:39)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryIn
> vocationHandler.java:82)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocati
> onHandler.java:59)
> >        at $Proxy0.addBlock(Unknown Source)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(
> DFSClient.java:2906)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream
> (DFSClient.java:2788)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient
> .java:2078)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSC
> lient.java:2264)
> >
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Error

> > Recovery for block null bad datanode[0] nodes == null
> > 2010-01-05 17:33:12,303 WARN org.apache.hadoop.hdfs.DFSClient: Could

> > not
> get
> > block locations. Source file "/hbase/feed/248946612/oldlogfile.log" 
> > - Aborting...
> >
> >
> > *And the region server logs contain this right after start up:*
> >
> > java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(
> ZooKeeperWrapper.java:332)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddress(ZooKeep
> erWrapper.java:318)
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readRootRegionLocat
> ion(ZooKeeperWrapper.java:231)
> >  at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> ava:442)
> > at java.lang.Thread.run(Thread.java:619)
> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for /hbase/root-region-server at 
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:102
> > )  at 
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
> >  at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(
> ZooKeeperWrapper.java:328)
> > ... 4 more
> >
> > *and*
> >
> > 2010-01-05 18:49:57,398 WARN
org.apache.hadoop.hbase.regionserver.Store:
> > Exception processing reconstruction log 
> > hdfs://master.cluster1:9000/hbase/feed/1281791924/oldlogfile.log 
> > opening comments -- continuing.  Probably lack-of-HADOOP-1700
causing DATA LOSS!
> > java.io.EOFException
> >        at
java.io.DataInputStream.readFully(DataInputStream.java:180)
> >        at
java.io.DataInputStream.readFully(DataInputStream.java:152)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >        at
> >
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.doReconstructionLog(Store.j
> ava:318)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.Store.runReconstructionLog(Store.
> java:267)
> >        at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:225)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion
> .java:1500)
> >        at
> >
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:305
)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(H
> RegionServer.java:1621)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionS
> erver.java:1588)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionS
> erver.java:1508)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > This last seems particularly strange, b/c HADOOP-1700 was fixed in 
> > Hadoop 0.19.
> >
> > Any help on what these exceptions mean and what I can about them 
> > would be appreciated.
> >
> > -Marc
> >
>
>
>
>
>

Mime
View raw message