hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters
Date Mon, 14 Jun 2010 18:44:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878692#action_12878692
] 

HBase Review Board commented on HBASE-2223:
-------------------------------------------

Message from: "Jean-Daniel Cryans" <jdcryans@apache.org>


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > bin/replication/add_peer.rb, line 21
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1104#file1104line21>
bq.  >
bq.  >     Should you point at some replication documentation here?  Is there such a thing?

package.html later, should I point to it?


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > bin/replication/copy_tables_desc.rb, line 58
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1105#file1105line58>
bq.  >
bq.  >     This could get a bit annoying I'd say.

It helped me a lot, remove if people complain?


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/HConstants.java, line 342
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1107#file1107line342>
bq.  >
bq.  >     This has to go here?  Can it go into one of the replication classes?

Used by master and region server, to me it belongs there.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/ServerManager.java, line 156
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1109#file1109line156>
bq.  >
bq.  >     Can't you just do c.get("key", defaultvalue)?

No, I also do a check on replication.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java, line 929
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1110#file1110line929>
bq.  >
bq.  >     You writing startcode into zk?  Why not write servername -- the host+port+startcode
combo?

To be coherent with the rest of the code that uses zookeeper.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java, line 1075
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1110#file1110line1075>
bq.  >
bq.  >     Is this directory name?  Confusingly named given rootdir+regLogPathStr only
adds up to repLogPath.

I don't understand you, but this code is going to be removed in my next patch as I'm simplifying
RepSink.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 55
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line55>
bq.  >
bq.  >     Peers are named '1', '2'?  Can't we have more meaningful names here?

We agreed that peers are identified with a short internally as it is stored. We could use
an external mapping of short->cute_name.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 59
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line59>
bq.  >
bq.  >     Use servername instead of startcode

Same comment as before, needs to be coherent.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 60
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line60>
bq.  >
bq.  >     All RS's in a master cluster replicate?

Yep... was that an implicit way of saying that I need to document that in RZH?


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 107
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line107>
bq.  >
bq.  >     Should this class be called WRapper instaad of Helper?

Sure


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 185
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line185>
bq.  >
bq.  >     You mean 'ensemble' here rather than 'quorum' (Patrick will kill you if he sees
you calling it a 'quorum' when you mean the other)

Argh I'm trying to correct myself but I'm still missing some of them. Thx!


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 263
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line263>
bq.  >
bq.  >     We keep up the replication position in zk?  How much do we replicate in one
go?  Its not a single edit, is it?  We do this for every log file?

Yes. A defined amount specified in ReplicationSource. No. Every current log file, we only
replicate one at a time per region server.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 328
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line328>
bq.  >
bq.  >     LOG.warn instead?
bq.  >

I'll do like the rest and log.error


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 354
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line354>
bq.  >
bq.  >     We return empty map if clusters size is == 1?  Should that be clusters.size
== 0?

That part isn't clear enough, so the reason it's 1 and not 0 is that we put a lock in there
so it's listed in the znodes we fetch. Actually this should be <= 1 rather than ==.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 356
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line356>
bq.  >
bq.  >     Whats this about?

See previous comment, we lock the dead region server's znode by putting a lock in there, but
we don't want to process the hlogs under since... it's not a cluster. Could use more doc.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
line 402
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line402>
bq.  >
bq.  >     Just logging errors?  What if session expired (our discussion from last day)?

Yes I need to review how I handle it in RZH, but I'd also need to review ZKW since some methods
will hid it in there.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 41
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line41>
bq.  >
bq.  >     Call it alpha

yeah! (j/k)


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 64
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line64>
bq.  >
bq.  >     Whats this about?  You need to run zk yourself but no zoo.cfg?

I... don't remember why I wrote this.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 73
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line73>
bq.  >
bq.  >     And if not?  What if replicating single-family only?

Forgot to update that after we added scoping, updating.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 83
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line83>
bq.  >
bq.  >     Has to be offline?  Will this always be the case?

Currently everything is static, but I hope we can move on from that in the future.


bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 108
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line108>
bq.  >
bq.  >     whats ratio?

This is a log snippet that's coming from a region server. Do you want to see more documentation
about it in package.html or in the logging itself?


- Jean-Daniel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/76/#review191
-----------------------------------------------------------





> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2223.patch
>
>
> We need a nice way of handling long network partitions without impacting a master cluster
(which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the
first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message