hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <lars.geo...@gmail.com>
Subject Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run
Date Sun, 13 Dec 2009 08:16:27 GMT
Just as a note, I think I had the same issue. This is on my 7+1 cluster
during a MR import job:

2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current region
memstore size 64.2m
2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 10000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.99.38:51729remote=/
192.168.99.38:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)

2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode as bad.
2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
2009-12-08 01:15:58,685 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
        at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:946)
        at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:839)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
Caused by: java.io.IOException: Error Recovery for block
blk_2400329754585253075_931440 failed  because recovery from primary
datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2078)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2241)
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=11.9, regions=37, stores=296, storefiles=697, storefileIndexSize=66,
memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392,
blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.99.38:60020.cacheFlusher exiting
2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020, call put([B@7237791f,
[Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from 192.168.99.34:34211:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 13 on 60020, call put([B@569a24a9,
[Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from 192.168.99.36:42492:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
...

On Sun, Dec 13, 2009 at 1:03 AM, stack <stack@duboce.net> wrote:

> I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
> it committed to 0.21.
> St.Ack
>
>
> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
> > I do. I think I saw it just last week with a failure case as follows on a
> > small testbed (aren't they all? :-/ ) that some of our devs are working
> > with:
> >
> > - Local RS and datanode are talking
> >
> > - Something happens to the datanode
> >    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> > channel to be ready for read. ch : java.nio.channels.SocketChannel
> >     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> >
> > - RS won't try talking to other datanodes elsewhere on the cluster
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7040605219500907455_6449696
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5367929502764356875_6449620
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7075535856966512941_6449680
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_77095304474221514_6449685
> >
> > - RS goes down
> >    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> > required.
> > Forcing server shutdown
> >    org.apache.hadoop.hbase.DroppedSnapshotException ...
> >
> > Not a blocker in that the downed RS with working sync in 0.21 won't lose
> > data and can be restarted. But, a critical issue because it will be
> > frequently encountered and will cause processes on the cluster to shut
> down.
> > Without some kind of "god" monitor or human intervention eventually there
> > will be insufficient resources to carry all regions.
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: Stack <saint.ack@gmail.com>
> > To: "hbase-dev@hadoop.apache.org" <hbase-dev@hadoop.apache.org>
> > Sent: Sat, December 12, 2009 1:01:49 PM
> > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> >
> > So we think this critical to hbase?
> > Stack
> >
> >
> >
> > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
> >
> > > All HBase committers should jump on that issue and +1. We should make
> > that kind of statement for the record.
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: stack (JIRA) <jira@apache.org>
> > > To: hbase-dev@hadoop.apache.org
> > > Sent: Sat, December 12, 2009 12:39:18 PM
> > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> > >
> > >
> > >     [
> >
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> > >
> > > stack resolved HBASE-1972.
> > > --------------------------
> > >
> > >    Resolution: Won't Fix
> > >
> > > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> > cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> > pushback I'd imagine since not "critical" but might make 0.21.1
> > >
> > >> Failed split results in closed region and non-registration of
> daughters;
> > fix the order in which things are run
> > >>
> >
> --------------------------------------------------------------------------------------------------------------
> > >>
> > >>                Key: HBASE-1972
> > >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> > >>            Project: Hadoop HBase
> > >>         Issue Type: Bug
> > >>           Reporter: stack
> > >>           Priority: Blocker
> > >>            Fix For: 0.21.0
> > >>
> > >>
> > >> As part of a split, we go to close the region.  The close fails
> because
> > flush failed -- a DN was down and HDFS refuses to move past it -- so we
> jump
> > up out of the close with an IOE.  But the region has been closed yet its
> > still in the .META. as online.
> > >> Here is where the hole is:
> > >> 1. CompactSplitThread calls split.
> > >> 2. This calls HRegion splitRegion.
> > >> 3. splitRegion calls close(false).
> > >> 4. Down the end of the close, we get as far as the LOG.info("Closed "
> +
> > this)..... but a DFSClient running thread throws an exception because it
> > can't allocate block for the flush made as part of the close (Ain't sure
> > how... we should add more try/catch in here):
> > >> {code}
> > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> >
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566
> ,
> > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
> flush
> > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> > id=2350017, compaction requested=false
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_1351692500502810095_1391
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-3310646336307339512_1391
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_3070440586900692765_1393
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5656011219762164043_1393
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-2359634393837722978_1393
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-1626727145091780831_1393
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
> file
> >
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> > - Aborting...
> > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > Compaction/Split failed for region TestTable,,1257986664542
> > >> java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> {code}
> > >> Marking this as blocker.
> > >
> > > --This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message