hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.
Date Tue, 09 Jan 2018 15:08:24 GMT
+1 on merging to master branch.

On Tue, Jan 9, 2018 at 6:51 AM, Josh Elser <elserj@apache.org> wrote:

> +1 on merge to Master.
>
> I appreciate the details you shared, Duo, but I think I'm still -0 on a
> branch-2 merge at this point. I'm with Stack: would rather pull it into a
> fast 2.1 release.
>
>
> On 1/8/18 8:22 PM, 张铎(Duo Zhang) wrote:
>
>> Anyway, if no objections on merging this into master, let's do it? So that
>> we can start working on the follow-on features, such as table based
>> replication storage, and synchronous replication, etc.
>>
>> Thanks.
>>
>> 2018-01-09 7:19 GMT+08:00 Stack <stack@duboce.net>:
>>
>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <palomino219@gmail.com>
>>> wrote:
>>>
>>> This 'new' feature only changes DDL part, not the core part of
>>>>
>>> replication,
>>>
>>>> i.e, how to read wal entries and how to replicate it to the remote
>>>>
>>> cluster,
>>>
>>>> etc. And also there is no pb message/storage layout change, you can
>>>> think
>>>> of this as a big refactoring. Theoretically we even do not need to add
>>>>
>>> new
>>>
>>>> UTs for this feature, i.e, no extra stability works. The only visible
>>>> change to users is that it may require more time on modifying peers in
>>>> shell. So in general I think it is less hurt to include it in the coming
>>>> release?
>>>>
>>>> And why I think it SHOULD be included in our 2.0 release is that, the
>>>> synchronous guarantee is really a good thing for our replication related
>>>> UTs. The correctness of the current Test***Replication usually depends
>>>>
>>> on a
>>>
>>>> flakey condition - we will not do a log rolling between the modification
>>>>
>>> on
>>>
>>>> zk and the actual loading of the modification on RS. And we have also
>>>>
>>> done
>>>
>>>> a hard work to cleanup the lockings in ReplicationSourceManager and add
>>>> a
>>>> fat comment to say why it should be synchronized in this way. In
>>>> general,
>>>> the new code is much easier to read, test and debug, and also reduce the
>>>> possibility of flakeyness, which could help us a lot when we start to
>>>> stabilize our build.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> You see it as a big bug fix Duo?
>>>
>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
>> stable,
>> but we know that we can not fix all the problems since we store state in
>> several places and it is a 'mission impossible' to make all the states
>> stay
>> in sync under every situation... So we introduce AM v2.
>> For the replication peer tracking, it is the same problem. It is hard to
>> do
>> fencing with zk watcher since it is asynchronous, so the UTs are always
>> kind of flakey in theoretical. And we depend on replication heavily at
>> Xiaomi, it is always a pain for us.
>>
>>
>>> I'm late to review. Will have a look after beta-1 goes out. This stuff
>>> looks great from outside, especially distributed procedure framework
>>> which
>>> we need all over the place.
>>>
>>> In general I have no problem w/ this in master and an hbase 2.1 (2.1
>>> could
>>> be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
>>> looksee sir.
>>>
>>> Thanks sir. All the concerns here about whether we should merge this into
>> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
>> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
>> only if you guys all think so.
>>
>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>>
>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <appy@cloudera.com>:
>>>>
>>>> Same questions as Josh's.
>>>>> 1) We have RCs for beta1 now, which means only commits that can go in
>>>>>
>>>> are
>>>
>>>> bug fixes only. This change - although important, needed from long time
>>>>>
>>>> and
>>>>
>>>>> well done (testing, summary, etc) - seems rather very large to get into
>>>>>
>>>> 2.0
>>>>
>>>>> now. Needs good justification why it has to be 2.1 instead of 2.0.
>>>>>
>>>>> -- Appy
>>>>>
>>>>>
>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <elserj@apache.org>
wrote:
>>>>>
>>>>> -0 From a general project planning point-of-view (not based on the
>>>>>> technical merit of the code) I am uncomfortable about pulling in
a
>>>>>>
>>>>> brand
>>>>
>>>>> new feature after we've already made one beta RC.
>>>>>>
>>>>>> Duo -- can you expand on why this feature is so important that we
>>>>>>
>>>>> should
>>>>
>>>>> break our release plan? Are there problems that would make including
>>>>>>
>>>>> this
>>>>
>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
>>>>>>
>>>>> provide
>>>
>>>> on
>>>>>
>>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>>
>>>>>>
>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>>
>>>>>>> We aim to move the peer modification framework from zk watcher
to
>>>>>>> procedure
>>>>>>> v2 in this issue and the work is done now.
>>>>>>>
>>>>>>> Copy the release note here:
>>>>>>>
>>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>>
>>>>>>> AddPeerProcedure
>>>>>>>> RemovePeerProcedure
>>>>>>>> UpdatePeerConfigProcedure
>>>>>>>> EnablePeerProcedure
>>>>>>>> DisablePeerProcedure
>>>>>>>>
>>>>>>>> The procedures are all executed with the following stage:
>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give
up
>>>>>>>> 2. Check whether the operation is valid, if not then give
up
>>>>>>>> 3. Update peer storage. Notice that if we have entered this
stage,
>>>>>>>>
>>>>>>> then
>>>>
>>>>> we
>>>>>>>> can not rollback any more.
>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
every RS.
>>>>>>>> 5. Do post cleanup if any.
>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
since we
>>>>>>>>
>>>>>>> have
>>>>>
>>>>>> already done the work.
>>>>>>>>
>>>>>>>> The procedure will hold an exclusive lock on the peer id,
so now
>>>>>>>>
>>>>>>> there
>>>>
>>>>> is
>>>>>
>>>>>> no concurrent modifications on a single peer.
>>>>>>>>
>>>>>>>> And now it is guaranteed that once the procedure is done,
the peer
>>>>>>>> modification has already taken effect on all RSes.
>>>>>>>>
>>>>>>>> Abstracte a storage layer for replication peer/queue manangement,
>>>>>>>>
>>>>>>> and
>>>
>>>> refactored the upper layer to remove zk related
>>>>>>>>
>>>>>>> naming/code/comment.
>>>
>>>>
>>>>>>>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver,
and
>>>>>>>>
>>>>>>> add
>>>>
>>>>> permission check for executeProcedures method which requires the
>>>>>>>>
>>>>>>> caller
>>>>
>>>>> to
>>>>>>>> be system user or super user.
>>>>>>>>
>>>>>>>> On rolling upgrade: just do not do any replication peer
>>>>>>>>
>>>>>>> modifications
>>>
>>>> during the rolling upgrading. There is no pb/layout changes on the
>>>>>>>> peer/queue storage on zk.
>>>>>>>>
>>>>>>>>
>>>>>>>> And there are other benefits.
>>>>>>> First, we have introduced a general procedure framework to send
>>>>>>>
>>>>>> tasks
>>>
>>>> to
>>>>
>>>>> RS
>>>>>>> and report the report back to Master. It can be used to implement
>>>>>>>
>>>>>> other
>>>>
>>>>> operations such as ACL change.
>>>>>>> Second, zk is used as a external storage now since we do not
depend
>>>>>>>
>>>>>> on
>>>
>>>> zk
>>>>>
>>>>>> watcher any more, it will be much easier to implement a 'table
>>>>>>>
>>>>>> based'
>>>
>>>> replication peer/queue storage.
>>>>>>>
>>>>>>> Please vote:
>>>>>>> [+1] Agree
>>>>>>> [-1] Disagree
>>>>>>> [0] Neutral
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -- Appy
>>>>>
>>>>>
>>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message