zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@apache.org>
Subject Re: ZooKeeper 3.5 blocker issues
Date Mon, 08 Oct 2018 10:34:27 GMT
Hi,

Let me summarize and give a quick update on the outstanding issues for 3.5 GA:

- ZOOKEEPER-1818 (Fix don't care for trunk)
- ZOOKEEPER-2778 (Potential server deadlock between follower sync with leader and follower
receiving external connection requests.)
- ZOOKEEPER-3021 Migrate project structure to Maven (ongoing)
- ZOOKEEPER-925 Docs generation to Maven
- ZOOKEEPER-3104 (waiting for backport)
- ZOOKEEPER-3125 (waiting for backport PR #647)

The 2 Maven related tickets are no-brainers as well as the backports. ZK-2778 has been picked
up by Maoling (thanks!) as far as I can see, ZK-1818 is the only one waiting for a volunteer.

Please correct me if I’ve missed something.

Regards,
Andor




> On 2018. Sep 28., at 18:32, Tamas Penzes <tamaas@cloudera.com.INVALID> wrote:
> 
> Hi All,
> 
> I would add ZOOKEEPER-3021
> <https://issues.apache.org/jira/browse/ZOOKEEPER-3021> Migrate project
> structure to Maven build as a blocker too. Since the migration has started
> it would be good to finish before releasing ZK 3.5.x GA.
> 
> ZOOKEEPER-925 <https://issues.apache.org/jira/browse/ZOOKEEPER-925> replace
> our forrest site and documentation generation might also be a good idea,
> since then we could deliver the new MarkDown based documentation.
> 
> Regards, Tamaas
> 
> On Fri, Sep 14, 2018 at 10:09 AM Fangmin Lv <lvfangmin@gmail.com> wrote:
> 
>> Oh, sorry for the confusion, I should provide more context.
>> 
>> Leader will use on disk txn sync with followers to if the peer zxid is not
>> in it's in memory commit logs, the code is here: Leader on disk txn sync
>> <
>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774
>>> .
>> There is bug that potentially there will be gap in the txn files, like
>> after snap sync, etc, so it's possible the peer will miss txns due to this.
>> 
>> The option to disable it is snapshotSizeFactor
>> <
>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81
>>> ,
>> set it to -1 will disable this feature. On 3.5, it's better to have a PR to
>> set this to -1 by default. It might have more SNAP sync, but from our prod
>> it doesn't seem to be a big problem to me.
>> 
>> I can send out the diff to disable it by default on 3.5 if you guys think
>> this is the right way to do.
>> 
>> Thanks,
>> Fangmin
>> 
>> On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <andor@apache.org> wrote:
>> 
>>> What’s needed to turn it off?
>>> Do we need a PR or it’s just a config option?
>>> Shall we implement a feature switch for that and turn it off by default?
>>> 
>>> Sorry I don’t have too much insight on disk txn sync.
>>> 
>>> Andor
>>> 
>>> 
>>> 
>>>> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com> wrote:
>>>> 
>>>> And to be clear, ZOOKEEPER-2418 is actually just one case of
>>> inconsistency
>>>> which could caused by on disk txn sync, as I mentioned in a newer JIRA
>>>> ZOOKEEPER-2846 <https://issues.apache.org/jira/browse/ZOOKEEPER-2846>,
>>> the
>>>> snap sync or txn sync could also leave txns gap in the txn file, which
>>> is a
>>>> more common case could trigger this issue.
>>>> 
>>>> I would suggest to turn off the on disk txn sync by default for now to
>>>> avoid this issue, after we finished ZOOKEEPER-3114, we can use that to
>>>> validate the on disk txns during syncing.
>>>> 
>>>> Thanks,
>>>> Fangmin
>>>> 
>>>> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com>
>> wrote:
>>>> 
>>>>> Andor,
>>>>> 
>>>>> ZOOKEEPER-3114 is about adding real time digest checking to help
>>> detecting
>>>>> inconsistency, it's a new feature with amounts of code change. I'll
>>> start
>>>>> upstream it part by part, but I don't expect it's being merged in the
>>> next
>>>>> few weeks. So yes, it's a nice to have, but definitely not a block for
>>> 3.5.
>>>>> 
>>>>> Thanks,
>>>>> Fangmin
>>>>> 
>>>>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org>
>> wrote:
>>>>> 
>>>>>> Fangmin,
>>>>>> 
>>>>>> Sorry, I just noticed that you want to include the consistency fixes
>> in
>>>>>> the stable version which is fine. Let’s finish the backports and
>> we’ll
>>> be
>>>>>> done with them.
>>>>>> 
>>>>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t block 3.5
>> with
>>>>>> that. What do you think?
>>>>>> 
>>>>>> Andor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org>
wrote:
>>>>>>> 
>>>>>>> Cool, thanks for the clarification.
>>>>>>> 
>>>>>>> The updated list is as follows:
>>>>>>> 
>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol)
>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between follower
sync
>> with
>>>>>> leader and follower receiving external connection requests.)
>>>>>>> 
>>>>>>> The following are not critical and no blockers for the stable
>> release:
>>>>>>> 
>>>>>>> Waiting for to be ported to 3.5:
>>>>>>> - ZOOKEEPER-3104
>>>>>>> - ZOOKEEPER-3125
>>>>>>> - ZOOKEEPER-3127
>>>>>>> 
>>>>>>> New feature:
>>>>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Andor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>> Hi Andor,
>>>>>>>> 
>>>>>>>> That's the on disk txn feature, which was disabled internally
after
>>> we
>>>>>>>> found the potentially inconsistent issue. The only solution
we have
>>>>>> for now
>>>>>>>> is waiting for the new digest checking feature I mentioned
in
>>>>>>>> ZOOKEEPER-3114.
>>>>>>>> 
>>>>>>>> I think there are some other critical consistent issues we
just
>> fixed
>>>>>> on
>>>>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125, ZOOKEEPER-3127,
I
>>>>>> think we
>>>>>>>> should include that in the official 3.5 release as well.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Fangmin
>>>>>>>> 
>>>>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár <andor@apache.org>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Jeelani,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks for letting me know. I'm happy to remove it from
the list
>> to
>>>>>> get
>>>>>>>>> closer to a stable release. :)
>>>>>>>>> 
>>>>>>>>> What's the feature which can be disabled to avoid data
>>> inconsistency?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Andor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani wrote:
>>>>>>>>>> Thanks Andor for compiling this. Should we be ignoring
>>>>>> ZOOKEEPER-2418 as
>>>>>>>>> well? This exists in 3.4 as well and the feature can
be disabled.
>> We
>>>>>> are
>>>>>>>>> working on a longer term fix for it in 3.6.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> 
>>>>>>>>>> Jeelani
>>>>>>>>>> 
>>>>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar" <andor@cloudera.com.INVALID
>>> 
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Fine.
>>>>>>>>>> 
>>>>>>>>>> I'm happy to ignore 1549, 2846 and 2930. Still we
have the list
>>> of:
>>>>>>>>>> 
>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast
protocol)
>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>>>>> - ZOOKEEPER-2418 (txnlog diff sync can skip sending
some
>>>>>>>>> transactions to
>>>>>>>>>> followers)
>>>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between
follower
>> sync
>>>>>>>>> with
>>>>>>>>>> leader and follower receiving external connection
requests.)
>>>>>>>>>> 
>>>>>>>>>> SSL (ZK-236) is a feature which essential for the
3.5 release,
>>>>>> hence
>>>>>>>>> I
>>>>>>>>>> wouldn't leave it out or postpone it for the next
stable
>> release.
>>>>>> PR
>>>>>>>>> has
>>>>>>>>>> been out for a long time, get on reviewing please.
>>>>>>>>>> The rest are also long outstanding issues which have
been found
>> in
>>>>>>>>> the 3.5
>>>>>>>>>> branch.
>>>>>>>>>> ZK-1818 is something which was found in 3.4 and fixed
in 3.4,
>> but
>>>>>>>>> never has
>>>>>>>>>> been fixed in 3.5. Quite a serious issue if still
present.
>>>>>>>>>> 
>>>>>>>>>> I think we should at least run some manual testing
and see if we
>>>>>>>>> could
>>>>>>>>>> repro any of these issues before going ahead with
a stable
>>> release.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Andor
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael Han <hanm@apache.org>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I haven't went through the entire list, but looks
like lots of
>> the
>>>>>>>>> JIRA
>>>>>>>>>>> issues listed in this thread, such as ZOOKEEPER-1549,
2846, also
>>>>>>>>> affects
>>>>>>>>>>> 3.4 releases. Should we scope these issues out?
>>>>>>>>>>> 
>>>>>>>>>>> I think historically the single outstanding blocking
issue for a
>>>>>>>>> stable 3.5
>>>>>>>>>>> release is the reconfig feature and security
concerns around it
>>>>>>>>> (somehow
>>>>>>>>>>> addressed in ZOOKEEPER-2014), and the alpha and
beta releases
>> were
>>>>>>>>> created
>>>>>>>>>>> to stabilize that feature.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
>>>>>>>>>>> SSL-release-date-tt7581744.html
>>>>>>>>>>> 
>>>>>>>>>>> So it looks like we are in good shape to release.
Something
>> might
>>>>>>>>> worth
>>>>>>>>>>> doing to claim the quality of 3.5 is on par with
3.4
>>>>>>>>>>> 
>>>>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed the test for
the record
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
>>>>>>>>>>> * Fix all flaky tests on 3.5 - 3.4 has little
or no flaky tests
>> at
>>>>>>>>> all.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM, Andor Molnar
>>>>>>>>> <andor@cloudera.com.invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks Maoling! That would be huge help,
I appreciate it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Andor
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>>> 


Mime
View raw message