hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: "Error recovery for block... failed because recovery from primary datanode failed 6 times"
Date Mon, 14 Feb 2011 18:18:05 GMT
Hey Bradford,

Could we see the full thing? I bet there's a bunch of ERROR. Look for
the dumping of metrics (grep for dump) and then get all the lines
before that (until you see it's doing normal stuff).

BTW that log is telling me that another region server died before that one.

J-D

On Sun, Feb 13, 2011 at 11:40 PM, Bradford Stephens
<bradfordstephens@gmail.com> wrote:
> We've got dfs.replication = 3 in hdfs-site.xml
>
> doing a grep for "FATAL" and the surrounding 50 lines yields this:
>
> Regionserver log: http://pastebin.com/3cYYNhct
>
> HMaster and DataNode logs seem pretty boring, no errors. Some sections
> of lots of scheduling/deleting blocks...
>
> Restarted the HBase nodes, ran the MR job again (it's just reading CSV
> into a table).
>
> Seems to be running just fine.
>
>
> On Sun, Feb 13, 2011 at 11:08 PM, Jonathan Gray <jgray@fb.com> wrote:
>> The DFS errors are after the server aborts.  What is in the log before the server
abort?  Doesn't seem to show any reason here which is unusual.
>>
>> Anything in the master?  Did it time out this RS?  You're running with replication
= 1?
>>
>>> -----Original Message-----
>>> From: Bradford Stephens [mailto:bradfordstephens@gmail.com]
>>> Sent: Sunday, February 13, 2011 10:59 PM
>>> To: user@hbase.apache.org
>>> Subject: "Error recovery for block... failed because recovery from primary
>>> datanode failed 6 times"
>>>
>>> Hey guys,
>>>
>>> I'm occasionally getting regionservers going down (running a late RC of .89
>>> that Ryan built). 5x c2.xlarge nodes (8gb/6 cores?) on EC2 with EBS drives.
>>>
>>> Here's the error message from the RS log. Hadoop fsck shows it's fine.
>>>
>>> Any ideas?
>>>
>>>
>>> 2011-02-14 01:51:51,715 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: Closed mobile4-
>>> 2011021,20110122:37b16319-58e8-4809-bca6-83d7598a41dd:E84F9612-CE1A-
>>> 4FE1-AAE9-
>>> 2A7AF8C9B2F1:21519,1297657239532.d15ce98030138cad79e248e0845b70ee.
>>> 2011-02-14 01:51:51,715 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
>>> at: ip-10-243-106-63.ec2.internal,60020,1297656774012
>>> 2011-02-14 01:51:51,711 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionCh
>>> ecker:
>>> regionserver60020.majorCompactionChecker exiting
>>> 2011-02-14 01:51:51,856 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x12e225ef5640002 closed
>>> 2011-02-14 01:51:51,856 DEBUG
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
>>> <ip-10-204-213-153.ec2.internal:/hbase,ip-10-243-106-
>>> 63.ec2.internal,60020,1297656773719>Closed
>>> connection with ZooKeeper; /hbase/root-region-server
>>> 2011-02-14 01:51:58,706 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread
>>> exiting
>>> 2011-02-14 01:51:58,706 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
>>> exiting
>>> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases:
>>> regionserver60020.leaseChecker closing leases
>>> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases:
>>> regionserver60020.leaseChecker closed leases
>>> 2011-02-14 01:52:00,033 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>>> starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-
>>> 10,5,main]
>>> 2011-02-14 01:52:00,033 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
>>> shutdown hook thread.
>>> 2011-02-14 01:52:00,036 ERROR org.apache.hadoop.hdfs.DFSClient:
>>> Exception closing file
>>> /hbase-entest/.logs/ip-10-243-106-
>>> 63.ec2.internal,60020,1297656774012/10.243.106.63%3A60020.1297660376363
>>> : java.io.IOException: IOException flush:java.io.IOException:
>>> IOException flush:java.io.IOException: IOException
>>> flush:java.io.IOException: Error Recovery for block
>>> blk_208685344091455182_10263 failed  because recovery from primary
>>> datanode 10.243.106.63:50010 failed 6 times.  Pipeline was
>>> 10.243.106.63:50010. Aborting...
>>> java.io.IOException: IOException flush:java.io.IOException:
>>> IOException flush:java.io.IOException: IOException
>>> flush:java.io.IOException: Error Recovery for block
>>> blk_208685344091455182_10263 failed  because recovery from primary
>>> datanode 10.243.106.63:50010 failed 6 times.  Pipeline was
>>> 10.243.106.63:50010. Aborting...
>>>       at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3
>>> 214)
>>>       at
>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:
>>> 97)
>>>       at
>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>>>       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown
>>> Source)
>>>       at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> sorImpl.java:25)
>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Se
>>> quenceFileLogWriter.java:123)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:906)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.completeCacheFlush(HLog
>>> .java:1078)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio
>>> n.java:943)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio
>>> n.java:834)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:78
>>> 6)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me
>>> mStoreFlusher.java:250)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me
>>> mStoreFlusher.java:224)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFl
>>> usher.java:146)
>>> 2011-02-14 01:52:00,076 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>>> finished.
>>> 2011-02-14 01:52:00,139 WARN
>>> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No
>>> longer connected to ZooKeeper, current state: Disconnected
>>>
>>>
>>> --
>>> Bradford Stephens,
>>> Founder, Drawn to Scale
>>> drawntoscalehq.com
>>> 727.697.7528
>>>
>>> http://www.drawntoscalehq.com --  The intuitive, cloud-scale data solution.
>>> Process, store, query, search, and serve all your data.
>>>
>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and
>>> Computer Science
>>
>
>
>
> --
> Bradford Stephens,
> Founder, Drawn to Scale
> drawntoscalehq.com
> 727.697.7528
>
> http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
> solution. Process, store, query, search, and serve all your data.
>
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>

Mime
View raw message