hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters
Date Fri, 12 Feb 2010 22:08:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833192#action_12833192

Jean-Daniel Cryans commented on HBASE-2223:

bq. Or just restart replication after the slave is back online and interleave edits from the
queue with new ones as necessary?

One thing I forgot to add is that the job would be configured to only treat timestamps newer
than x.

So the problem with resending those edits is something I tackled in HBASE-2197. If one cluster
gets very very late like 2 hours, we have to decide where we are going to get that data. One
option is using the old log files but also the log files that are currently in the region
servers. It ain't so bad, but what happens in the case of failure? In 2197, the first solution
I described involves using a distributed queue where all RS would participate in processing
each log file and interleave them with the rest of the stream.

Another option is keeping yet another set of log files, separate from the "normal" ones, that
we use to flush log entries if some cluster gets late. Then if a region server dies, we process
both sets of log files.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
> We need a nice way of handling long network partitions without impacting a master cluster
(which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the
first 2 parts. Discuss.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message