zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnár <an...@apache.org>
Subject Re: ZooKeeper 3.5 blocker issues
Date Mon, 17 Dec 2018 22:55:45 GMT
Hi team,


I'm proudly announce that thanks to the joint effort from the community,
the 3.5 blockers list has become empty:

"project = ZooKeeper AND resolution = Unresolved AND fixVersion = 3.5.5
AND priority in (blocker, critical) ORDER BY priority DESC, key ASC"


Well... almost. All the blocker issues have gone, but we still have the
Maven migration to complete before the stable release. If you have some
free cycles, please join us testing the Maven build on this PR:

https://github.com/apache/zookeeper/pull/708

I hope we can merge it pretty soon.


In terms of the builds, the weather at 3.5 branch is quite sunny nowadays:

https://builds.apache.org/view/S-Z/view/ZooKeeper/

The Java 11 build is still having some difficulties, which hopefully I
can address before the holidays:

https://issues.apache.org/jira/browse/ZOOKEEPER-3204


If you happen to know about something which is important from 3.5's
perspective and missing from the above, please don't hesitate to share.


Happy ZooKeeping!

Andor



On 11/2/18 21:12, Fangmin Lv wrote:
> Andor,
>
> Here is the PR to port ZK-3104 from master to 3.4:
> https://github.com/apache/zookeeper/pull/685.
>
> Fangmin
>
> On Fri, Nov 2, 2018 at 11:46 AM Fangmin Lv <lvfangmin@gmail.com> wrote:
>
>> Hi Andor,
>>
>> Is anyone working on ZK-2778? I can pick it up if there is no one working
>> on it yet.
>>
>> I'll open a 3.5 PR for ZK-3104 today.
>>
>> Fangmin
>>
>> On Fri, Oct 26, 2018 at 3:33 AM Andor Molnar <andor@apache.org> wrote:
>>
>>> Hi folks,
>>>
>>> You’ve probably realised lots of update emails coming from Jira. Please
>>> be aware that we’ve updated a bunch of open blocker/critical 3.5 tickets to
>>> reflect to what we discussed in this email.
>>>
>>> If you open up the following jira filter:
>>>
>>> project = ZooKeeper and resolution = Unresolved and fixVersion = 3.5.5
>>> AND priority in (blocker, critical) ORDER BY priority DESC, key ASC
>>>
>>> You’ll see the most up-to-date list of tickets which need to be addressed
>>> before the stable 3.5 release.
>>>
>>> Thank you for your efforts to get this done.
>>>
>>> Fangmin, ZK-3104 is waiting for backport, but ticket has already been
>>> resolved. Have you created a separate ticket for the backport or shall I
>>> just reopen it with the right fix versions?
>>>
>>> Thanks,
>>> Andor
>>>
>>>
>>>
>>>> On 2018. Oct 8., at 12:34, Andor Molnar <andor@apache.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Let me summarize and give a quick update on the outstanding issues for
>>> 3.5 GA:
>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>> - ZOOKEEPER-2778 (Potential server deadlock between follower sync with
>>> leader and follower receiving external connection requests.)
>>>> - ZOOKEEPER-3021 Migrate project structure to Maven (ongoing)
>>>> - ZOOKEEPER-925 Docs generation to Maven
>>>> - ZOOKEEPER-3104 (waiting for backport)
>>>> - ZOOKEEPER-3125 (waiting for backport PR #647)
>>>>
>>>> The 2 Maven related tickets are no-brainers as well as the backports.
>>> ZK-2778 has been picked up by Maoling (thanks!) as far as I can see,
>>> ZK-1818 is the only one waiting for a volunteer.
>>>> Please correct me if I’ve missed something.
>>>>
>>>> Regards,
>>>> Andor
>>>>
>>>>
>>>>
>>>>
>>>>> On 2018. Sep 28., at 18:32, Tamas Penzes <tamaas@cloudera.com.INVALID>
>>> wrote:
>>>>> Hi All,
>>>>>
>>>>> I would add ZOOKEEPER-3021
>>>>> <https://issues.apache.org/jira/browse/ZOOKEEPER-3021> Migrate
project
>>>>> structure to Maven build as a blocker too. Since the migration has
>>> started
>>>>> it would be good to finish before releasing ZK 3.5.x GA.
>>>>>
>>>>> ZOOKEEPER-925 <https://issues.apache.org/jira/browse/ZOOKEEPER-925>
>>> replace
>>>>> our forrest site and documentation generation might also be a good
>>> idea,
>>>>> since then we could deliver the new MarkDown based documentation.
>>>>>
>>>>> Regards, Tamaas
>>>>>
>>>>> On Fri, Sep 14, 2018 at 10:09 AM Fangmin Lv <lvfangmin@gmail.com>
>>> wrote:
>>>>>> Oh, sorry for the confusion, I should provide more context.
>>>>>>
>>>>>> Leader will use on disk txn sync with followers to if the peer zxid
>>> is not
>>>>>> in it's in memory commit logs, the code is here: Leader on disk txn
>>> sync
>>>>>> <
>>>>>>
>>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774
>>>>>>> .
>>>>>> There is bug that potentially there will be gap in the txn files,
like
>>>>>> after snap sync, etc, so it's possible the peer will miss txns due
to
>>> this.
>>>>>> The option to disable it is snapshotSizeFactor
>>>>>> <
>>>>>>
>>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81
>>>>>>> ,
>>>>>> set it to -1 will disable this feature. On 3.5, it's better to have
a
>>> PR to
>>>>>> set this to -1 by default. It might have more SNAP sync, but from
our
>>> prod
>>>>>> it doesn't seem to be a big problem to me.
>>>>>>
>>>>>> I can send out the diff to disable it by default on 3.5 if you guys
>>> think
>>>>>> this is the right way to do.
>>>>>>
>>>>>> Thanks,
>>>>>> Fangmin
>>>>>>
>>>>>> On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <andor@apache.org>
>>> wrote:
>>>>>>> What’s needed to turn it off?
>>>>>>> Do we need a PR or it’s just a config option?
>>>>>>> Shall we implement a feature switch for that and turn it off
by
>>> default?
>>>>>>> Sorry I don’t have too much insight on disk txn sync.
>>>>>>>
>>>>>>> Andor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com>
wrote:
>>>>>>>>
>>>>>>>> And to be clear, ZOOKEEPER-2418 is actually just one case
of
>>>>>>> inconsistency
>>>>>>>> which could caused by on disk txn sync, as I mentioned in
a newer
>>> JIRA
>>>>>>>> ZOOKEEPER-2846 <
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2846>,
>>>>>>> the
>>>>>>>> snap sync or txn sync could also leave txns gap in the txn
file,
>>> which
>>>>>>> is a
>>>>>>>> more common case could trigger this issue.
>>>>>>>>
>>>>>>>> I would suggest to turn off the on disk txn sync by default
for now
>>> to
>>>>>>>> avoid this issue, after we finished ZOOKEEPER-3114, we can
use that
>>> to
>>>>>>>> validate the on disk txns during syncing.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Fangmin
>>>>>>>>
>>>>>>>> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com>
>>>>>> wrote:
>>>>>>>>> Andor,
>>>>>>>>>
>>>>>>>>> ZOOKEEPER-3114 is about adding real time digest checking
to help
>>>>>>> detecting
>>>>>>>>> inconsistency, it's a new feature with amounts of code
change. I'll
>>>>>>> start
>>>>>>>>> upstream it part by part, but I don't expect it's being
merged in
>>> the
>>>>>>> next
>>>>>>>>> few weeks. So yes, it's a nice to have, but definitely
not a block
>>> for
>>>>>>> 3.5.
>>>>>>>>> Thanks,
>>>>>>>>> Fangmin
>>>>>>>>>
>>>>>>>>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org>
>>>>>> wrote:
>>>>>>>>>> Fangmin,
>>>>>>>>>>
>>>>>>>>>> Sorry, I just noticed that you want to include the
consistency
>>> fixes
>>>>>> in
>>>>>>>>>> the stable version which is fine. Let’s finish
the backports and
>>>>>> we’ll
>>>>>>> be
>>>>>>>>>> done with them.
>>>>>>>>>>
>>>>>>>>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t
block 3.5
>>>>>> with
>>>>>>>>>> that. What do you think?
>>>>>>>>>>
>>>>>>>>>> Andor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org>
>>> wrote:
>>>>>>>>>>> Cool, thanks for the clarification.
>>>>>>>>>>>
>>>>>>>>>>> The updated list is as follows:
>>>>>>>>>>>
>>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast
protocol)
>>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between
follower sync
>>>>>> with
>>>>>>>>>> leader and follower receiving external connection
requests.)
>>>>>>>>>>> The following are not critical and no blockers
for the stable
>>>>>> release:
>>>>>>>>>>> Waiting for to be ported to 3.5:
>>>>>>>>>>> - ZOOKEEPER-3104
>>>>>>>>>>> - ZOOKEEPER-3125
>>>>>>>>>>> - ZOOKEEPER-3127
>>>>>>>>>>>
>>>>>>>>>>> New feature:
>>>>>>>>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andor
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com>
>>> wrote:
>>>>>>>>>>>> Hi Andor,
>>>>>>>>>>>>
>>>>>>>>>>>> That's the on disk txn feature, which was
disabled internally
>>> after
>>>>>>> we
>>>>>>>>>>>> found the potentially inconsistent issue.
The only solution we
>>> have
>>>>>>>>>> for now
>>>>>>>>>>>> is waiting for the new digest checking feature
I mentioned in
>>>>>>>>>>>> ZOOKEEPER-3114.
>>>>>>>>>>>>
>>>>>>>>>>>> I think there are some other critical consistent
issues we just
>>>>>> fixed
>>>>>>>>>> on
>>>>>>>>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125,
>>> ZOOKEEPER-3127, I
>>>>>>>>>> think we
>>>>>>>>>>>> should include that in the official 3.5 release
as well.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Fangmin
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár
<andor@apache.org
>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Jeelani,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for letting me know. I'm happy
to remove it from the
>>> list
>>>>>> to
>>>>>>>>>> get
>>>>>>>>>>>>> closer to a stable release. :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's the feature which can be disabled
to avoid data
>>>>>>> inconsistency?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andor
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani
wrote:
>>>>>>>>>>>>>> Thanks Andor for compiling this.
Should we be ignoring
>>>>>>>>>> ZOOKEEPER-2418 as
>>>>>>>>>>>>> well? This exists in 3.4 as well and
the feature can be
>>> disabled.
>>>>>> We
>>>>>>>>>> are
>>>>>>>>>>>>> working on a longer term fix for it in
3.6.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jeelani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar"
>>> <andor@cloudera.com.INVALID
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm happy to ignore 1549, 2846 and
2930. Still we have the
>>> list
>>>>>>> of:
>>>>>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support
for Atomic Broadcast
>>> protocol)
>>>>>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care
for trunk)
>>>>>>>>>>>>>> - ZOOKEEPER-2418 (txnlog diff sync
can skip sending some
>>>>>>>>>>>>> transactions to
>>>>>>>>>>>>>> followers)
>>>>>>>>>>>>>> - ZOOKEEPER-2778 (Potential server
deadlock between follower
>>>>>> sync
>>>>>>>>>>>>> with
>>>>>>>>>>>>>> leader and follower receiving external
connection requests.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SSL (ZK-236) is a feature which essential
for the 3.5 release,
>>>>>>>>>> hence
>>>>>>>>>>>>> I
>>>>>>>>>>>>>> wouldn't leave it out or postpone
it for the next stable
>>>>>> release.
>>>>>>>>>> PR
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> been out for a long time, get on
reviewing please.
>>>>>>>>>>>>>> The rest are also long outstanding
issues which have been
>>> found
>>>>>> in
>>>>>>>>>>>>> the 3.5
>>>>>>>>>>>>>> branch.
>>>>>>>>>>>>>> ZK-1818 is something which was found
in 3.4 and fixed in 3.4,
>>>>>> but
>>>>>>>>>>>>> never has
>>>>>>>>>>>>>> been fixed in 3.5. Quite a serious
issue if still present.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we should at least run some
manual testing and see if
>>> we
>>>>>>>>>>>>> could
>>>>>>>>>>>>>> repro any of these issues before
going ahead with a stable
>>>>>>> release.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Andor
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael
Han <hanm@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I haven't went through the entire
list, but looks like lots
>>> of
>>>>>> the
>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>> issues listed in this thread,
such as ZOOKEEPER-1549, 2846,
>>> also
>>>>>>>>>>>>> affects
>>>>>>>>>>>>>>> 3.4 releases. Should we scope
these issues out?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think historically the single
outstanding blocking issue
>>> for a
>>>>>>>>>>>>> stable 3.5
>>>>>>>>>>>>>>> release is the reconfig feature
and security concerns around
>>> it
>>>>>>>>>>>>> (somehow
>>>>>>>>>>>>>>> addressed in ZOOKEEPER-2014),
and the alpha and beta releases
>>>>>> were
>>>>>>>>>>>>> created
>>>>>>>>>>>>>>> to stabilize that feature.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
>>>>>>>>>>>>>>> SSL-release-date-tt7581744.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So it looks like we are in good
shape to release. Something
>>>>>> might
>>>>>>>>>>>>> worth
>>>>>>>>>>>>>>> doing to claim the quality of
3.5 is on par with 3.4
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed
the test for the record
>>>>>>>>>>>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
>>>>>>>>>>>>>>> * Fix all flaky tests on 3.5
- 3.4 has little or no flaky
>>> tests
>>>>>> at
>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM,
Andor Molnar
>>>>>>>>>>>>> <andor@cloudera.com.invalid>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Maoling! That would
be huge help, I appreciate it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andor
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message