hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run
Date Sun, 13 Dec 2009 00:03:52 GMT
I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
it committed to 0.21.
St.Ack


On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <apurtell@apache.org> wrote:

> I do. I think I saw it just last week with a failure case as follows on a
> small testbed (aren't they all? :-/ ) that some of our devs are working
> with:
>
> - Local RS and datanode are talking
>
> - Something happens to the datanode
>    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel
>     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>
> - RS won't try talking to other datanodes elsewhere on the cluster
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7040605219500907455_6449696
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5367929502764356875_6449620
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7075535856966512941_6449680
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_77095304474221514_6449685
>
> - RS goes down
>    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required.
> Forcing server shutdown
>    org.apache.hadoop.hbase.DroppedSnapshotException ...
>
> Not a blocker in that the downed RS with working sync in 0.21 won't lose
> data and can be restarted. But, a critical issue because it will be
> frequently encountered and will cause processes on the cluster to shut down.
> Without some kind of "god" monitor or human intervention eventually there
> will be insufficient resources to carry all regions.
>
>   - Andy
>
>
>
>
> ________________________________
> From: Stack <saint.ack@gmail.com>
> To: "hbase-dev@hadoop.apache.org" <hbase-dev@hadoop.apache.org>
> Sent: Sat, December 12, 2009 1:01:49 PM
> Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
>
> So we think this critical to hbase?
> Stack
>
>
>
> On Dec 12, 2009, at 12:43 PM, Andrew Purtell <apurtell@apache.org> wrote:
>
> > All HBase committers should jump on that issue and +1. We should make
> that kind of statement for the record.
> >
> >
> >
> >
> > ________________________________
> > From: stack (JIRA) <jira@apache.org>
> > To: hbase-dev@hadoop.apache.org
> > Sent: Sat, December 12, 2009 12:39:18 PM
> > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
> >
> >
> >     [
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > stack resolved HBASE-1972.
> > --------------------------
> >
> >    Resolution: Won't Fix
> >
> > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> pushback I'd imagine since not "critical" but might make 0.21.1
> >
> >> Failed split results in closed region and non-registration of daughters;
> fix the order in which things are run
> >>
> --------------------------------------------------------------------------------------------------------------
> >>
> >>                Key: HBASE-1972
> >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> >>            Project: Hadoop HBase
> >>         Issue Type: Bug
> >>           Reporter: stack
> >>           Priority: Blocker
> >>            Fix For: 0.21.0
> >>
> >>
> >> As part of a split, we go to close the region.  The close fails because
> flush failed -- a DN was down and HDFS refuses to move past it -- so we jump
> up out of the close with an IOE.  But the region has been closed yet its
> still in the .META. as online.
> >> Here is where the hole is:
> >> 1. CompactSplitThread calls split.
> >> 2. This calls HRegion splitRegion.
> >> 3. splitRegion calls close(false).
> >> 4. Down the end of the close, we get as far as the LOG.info("Closed " +
> this)..... but a DFSClient running thread throws an exception because it
> can't allocate block for the flush made as part of the close (Ain't sure
> how... we should add more try/catch in here):
> >> {code}
> >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> id=2350017, compaction requested=false
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_1351692500502810095_1391
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-3310646336307339512_1391
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_3070440586900692765_1393
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5656011219762164043_1393
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-2359634393837722978_1393
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-1626727145091780831_1393
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> - Aborting...
> >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split failed for region TestTable,,1257986664542
> >> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> {code}
> >> Marking this as blocker.
> >
> > --This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message