hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2312) Possible data loss when RS goes into GC pause while rolling HLog
Date Wed, 28 Jul 2010 21:27:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893375#action_12893375
] 

HBase Review Board commented on HBASE-2312:
-------------------------------------------

Message from: "Todd Lipcon" <todd@cloudera.com>


bq.  On 2010-07-28 10:17:14, Todd Lipcon wrote:
bq.  > This looks good, except that it requires the HDFS patches to be in place to even
compile. Although I think it's OK to not fix the bug in the case that we're on stock HDFS,
I think we still need to be able to run, buggily.
bq.  > 
bq.  > Did we determine that all of the other solutions were flawed/too complicated?
bq.  
bq.  Nicolas wrote:
bq.      HBASE-2312 had some lengthy discussions that ultimately led to this route.  For 0.21
+ 0.22, only HADOOP-6840 is needed.  That is a very minor change consisting of only small
API additions (so, no regression needed) that Dhruba & I think the lack of risk will make
it easy to approve.  0.20-append needs a couple more JIRAs, but that should be even less flack.
 As long as we ship with 0.20-append HDFS or newer, we'll be fine.
bq.      
bq.      I understand that this diff is early and we will have to wait until we ship with
the 0.20-append JAR before application.  Basically, I also wanted to show that the API change
indeed fixes our issue painlessly.  Any +1s on HADOOP-6840 would be appreciated :)  Your comment
is that we should also backward-support 0.20.3?

Yea, I think we decided at one point that we should be able to run against a vanilla apache
cluster, just that it would be "at your own risk" - ie that the bug fixes wouldn't necessarily
work. EG this is why we do the reflection to check for the syncFs() method and warn in the
case when it's not there, but continue to function.

In this patch, it would actually fail to work at all, since the RPC for non-recursive create
would get an error at the NN.


- Todd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/396/#review513
-----------------------------------------------------------





> Possible data loss when RS goes into GC pause while rolling HLog
> ----------------------------------------------------------------
>
>                 Key: HBASE-2312
>                 URL: https://issues.apache.org/jira/browse/HBASE-2312
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.3
>            Reporter: Karthik Ranganathan
>            Assignee: Nicolas Spiegelberg
>
> There is a very corner case when bad things could happen(ie data loss):
> 1)	RS #1 is going to roll its HLog - not yet created the new one, old one will get no
more writes
> 2)	RS #1 enters GC Pause of Death
> 3)	Master lists HLog files of RS#1 that is has to split as RS#1 is dead, starts splitting
> 4)	RS #1 wakes up, created the new HLog (previous one was rolled) and appends an edit
- which is lost
> The following seems like a possible solution:
> 1)	Master detects RS#1 is dead
> 2)	The master renames the /hbase/.logs/<regionserver name>  directory to something
else (say /hbase/.logs/<regionserver name>-dead)
> 3)	Add mkdir support (as opposed to mkdirs) to HDFS - so that a file create fails if
the directory doesn't exist. Dhruba tells me this is very doable.
> 4)	RS#1 comes back up and is not able create the new hlog. It restarts itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message