hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run
Date Sat, 12 Dec 2009 21:37:58 GMT
I do. I think I saw it just last week with a failure case as follows on a small testbed (aren't
they all? :-/ ) that some of our devs are working with:

- Local RS and datanode are talking

- Something happens to the datanode 
    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel
    org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable
to create new block.
    
- RS won't try talking to other datanodes elsewhere on the cluster
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7040605219500907455_6449696 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5367929502764356875_6449620 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7075535856966512941_6449680 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_77095304474221514_6449685 

- RS goes down
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. 
Forcing server shutdown
    org.apache.hadoop.hbase.DroppedSnapshotException ...

Not a blocker in that the downed RS with working sync in 0.21 won't lose data and can be restarted.
But, a critical issue because it will be frequently encountered and will cause processes on
the cluster to shut down. Without some kind of "god" monitor or human intervention eventually
there will be insufficient resources to carry all regions. 

   - Andy




________________________________
From: Stack <saint.ack@gmail.com>
To: "hbase-dev@hadoop.apache.org" <hbase-dev@hadoop.apache.org>
Sent: Sat, December 12, 2009 1:01:49 PM
Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration
of daughters; fix the order in which things are run

So we think this critical to hbase?
Stack



On Dec 12, 2009, at 12:43 PM, Andrew Purtell <apurtell@apache.org> wrote:

> All HBase committers should jump on that issue and +1. We should make that kind of statement
for the record.
> 
> 
> 
> 
> ________________________________
> From: stack (JIRA) <jira@apache.org>
> To: hbase-dev@hadoop.apache.org
> Sent: Sat, December 12, 2009 12:39:18 PM
> Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration
of daughters; fix the order in which things are run
> 
> 
>     [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
> 
> stack resolved HBASE-1972.
> --------------------------
> 
>    Resolution: Won't Fix
> 
> Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.  Want to
open an issue on getting 630 into 0.21.   There will be pushback I'd imagine since not "critical"
but might make 0.21.1
> 
>> Failed split results in closed region and non-registration of daughters; fix the
order in which things are run
>> --------------------------------------------------------------------------------------------------------------
>> 
>>                Key: HBASE-1972
>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>            Project: Hadoop HBase
>>         Issue Type: Bug
>>           Reporter: stack
>>           Priority: Blocker
>>            Fix For: 0.21.0
>> 
>> 
>> As part of a split, we go to close the region.  The close fails because flush failed
-- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an
IOE.  But the region has been closed yet its still in the .META. as online.
>> Here is where the hole is:
>> 1. CompactSplitThread calls split.
>> 2. This calls HRegion splitRegion.
>> 3. splitRegion calls close(false).
>> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this).....
but a DFSClient running thread throws an exception because it can't allocate block for the
flush made as part of the close (Ain't sure how... we should add more try/catch in here):
>> {code}
>> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store:
Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
id=2350017, compaction requested=false
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed info
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed TestTable,,1257986664542
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_1351692500502810095_1391
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-3310646336307339512_1391
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_3070440586900692765_1393
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-5656011219762164043_1393
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-2359634393837722978_1393
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-1626727145091780831_1393
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.io.IOException: Unable to create new block.
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could
not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
- Aborting...
>> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction/Split failed for region TestTable,,1257986664542
>> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> {code}
>> Marking this as blocker.
> 
> --This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message