zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fangmin Lv <lvfang...@gmail.com>
Subject Re: ZooKeeper 3.5 blocker issues
Date Fri, 02 Nov 2018 20:12:00 GMT
Andor,

Here is the PR to port ZK-3104 from master to 3.4:
https://github.com/apache/zookeeper/pull/685.

Fangmin

On Fri, Nov 2, 2018 at 11:46 AM Fangmin Lv <lvfangmin@gmail.com> wrote:

> Hi Andor,
>
> Is anyone working on ZK-2778? I can pick it up if there is no one working
> on it yet.
>
> I'll open a 3.5 PR for ZK-3104 today.
>
> Fangmin
>
> On Fri, Oct 26, 2018 at 3:33 AM Andor Molnar <andor@apache.org> wrote:
>
>> Hi folks,
>>
>> You’ve probably realised lots of update emails coming from Jira. Please
>> be aware that we’ve updated a bunch of open blocker/critical 3.5 tickets to
>> reflect to what we discussed in this email.
>>
>> If you open up the following jira filter:
>>
>> project = ZooKeeper and resolution = Unresolved and fixVersion = 3.5.5
>> AND priority in (blocker, critical) ORDER BY priority DESC, key ASC
>>
>> You’ll see the most up-to-date list of tickets which need to be addressed
>> before the stable 3.5 release.
>>
>> Thank you for your efforts to get this done.
>>
>> Fangmin, ZK-3104 is waiting for backport, but ticket has already been
>> resolved. Have you created a separate ticket for the backport or shall I
>> just reopen it with the right fix versions?
>>
>> Thanks,
>> Andor
>>
>>
>>
>> > On 2018. Oct 8., at 12:34, Andor Molnar <andor@apache.org> wrote:
>> >
>> > Hi,
>> >
>> > Let me summarize and give a quick update on the outstanding issues for
>> 3.5 GA:
>> >
>> > - ZOOKEEPER-1818 (Fix don't care for trunk)
>> > - ZOOKEEPER-2778 (Potential server deadlock between follower sync with
>> leader and follower receiving external connection requests.)
>> > - ZOOKEEPER-3021 Migrate project structure to Maven (ongoing)
>> > - ZOOKEEPER-925 Docs generation to Maven
>> > - ZOOKEEPER-3104 (waiting for backport)
>> > - ZOOKEEPER-3125 (waiting for backport PR #647)
>> >
>> > The 2 Maven related tickets are no-brainers as well as the backports.
>> ZK-2778 has been picked up by Maoling (thanks!) as far as I can see,
>> ZK-1818 is the only one waiting for a volunteer.
>> >
>> > Please correct me if I’ve missed something.
>> >
>> > Regards,
>> > Andor
>> >
>> >
>> >
>> >
>> >> On 2018. Sep 28., at 18:32, Tamas Penzes <tamaas@cloudera.com.INVALID>
>> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I would add ZOOKEEPER-3021
>> >> <https://issues.apache.org/jira/browse/ZOOKEEPER-3021> Migrate project
>> >> structure to Maven build as a blocker too. Since the migration has
>> started
>> >> it would be good to finish before releasing ZK 3.5.x GA.
>> >>
>> >> ZOOKEEPER-925 <https://issues.apache.org/jira/browse/ZOOKEEPER-925>
>> replace
>> >> our forrest site and documentation generation might also be a good
>> idea,
>> >> since then we could deliver the new MarkDown based documentation.
>> >>
>> >> Regards, Tamaas
>> >>
>> >> On Fri, Sep 14, 2018 at 10:09 AM Fangmin Lv <lvfangmin@gmail.com>
>> wrote:
>> >>
>> >>> Oh, sorry for the confusion, I should provide more context.
>> >>>
>> >>> Leader will use on disk txn sync with followers to if the peer zxid
>> is not
>> >>> in it's in memory commit logs, the code is here: Leader on disk txn
>> sync
>> >>> <
>> >>>
>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774
>> >>>> .
>> >>> There is bug that potentially there will be gap in the txn files, like
>> >>> after snap sync, etc, so it's possible the peer will miss txns due to
>> this.
>> >>>
>> >>> The option to disable it is snapshotSizeFactor
>> >>> <
>> >>>
>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81
>> >>>> ,
>> >>> set it to -1 will disable this feature. On 3.5, it's better to have
a
>> PR to
>> >>> set this to -1 by default. It might have more SNAP sync, but from our
>> prod
>> >>> it doesn't seem to be a big problem to me.
>> >>>
>> >>> I can send out the diff to disable it by default on 3.5 if you guys
>> think
>> >>> this is the right way to do.
>> >>>
>> >>> Thanks,
>> >>> Fangmin
>> >>>
>> >>> On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <andor@apache.org>
>> wrote:
>> >>>
>> >>>> What’s needed to turn it off?
>> >>>> Do we need a PR or it’s just a config option?
>> >>>> Shall we implement a feature switch for that and turn it off by
>> default?
>> >>>>
>> >>>> Sorry I don’t have too much insight on disk txn sync.
>> >>>>
>> >>>> Andor
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com>
wrote:
>> >>>>>
>> >>>>> And to be clear, ZOOKEEPER-2418 is actually just one case of
>> >>>> inconsistency
>> >>>>> which could caused by on disk txn sync, as I mentioned in a
newer
>> JIRA
>> >>>>> ZOOKEEPER-2846 <
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2846>,
>> >>>> the
>> >>>>> snap sync or txn sync could also leave txns gap in the txn file,
>> which
>> >>>> is a
>> >>>>> more common case could trigger this issue.
>> >>>>>
>> >>>>> I would suggest to turn off the on disk txn sync by default
for now
>> to
>> >>>>> avoid this issue, after we finished ZOOKEEPER-3114, we can use
that
>> to
>> >>>>> validate the on disk txns during syncing.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Fangmin
>> >>>>>
>> >>>>> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com>
>> >>> wrote:
>> >>>>>
>> >>>>>> Andor,
>> >>>>>>
>> >>>>>> ZOOKEEPER-3114 is about adding real time digest checking
to help
>> >>>> detecting
>> >>>>>> inconsistency, it's a new feature with amounts of code change.
I'll
>> >>>> start
>> >>>>>> upstream it part by part, but I don't expect it's being
merged in
>> the
>> >>>> next
>> >>>>>> few weeks. So yes, it's a nice to have, but definitely not
a block
>> for
>> >>>> 3.5.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Fangmin
>> >>>>>>
>> >>>>>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org>
>> >>> wrote:
>> >>>>>>
>> >>>>>>> Fangmin,
>> >>>>>>>
>> >>>>>>> Sorry, I just noticed that you want to include the consistency
>> fixes
>> >>> in
>> >>>>>>> the stable version which is fine. Let’s finish the
backports and
>> >>> we’ll
>> >>>> be
>> >>>>>>> done with them.
>> >>>>>>>
>> >>>>>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t
block 3.5
>> >>> with
>> >>>>>>> that. What do you think?
>> >>>>>>>
>> >>>>>>> Andor
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org>
>> wrote:
>> >>>>>>>>
>> >>>>>>>> Cool, thanks for the clarification.
>> >>>>>>>>
>> >>>>>>>> The updated list is as follows:
>> >>>>>>>>
>> >>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast
protocol)
>> >>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>> >>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between
follower sync
>> >>> with
>> >>>>>>> leader and follower receiving external connection requests.)
>> >>>>>>>>
>> >>>>>>>> The following are not critical and no blockers for
the stable
>> >>> release:
>> >>>>>>>>
>> >>>>>>>> Waiting for to be ported to 3.5:
>> >>>>>>>> - ZOOKEEPER-3104
>> >>>>>>>> - ZOOKEEPER-3125
>> >>>>>>>> - ZOOKEEPER-3127
>> >>>>>>>>
>> >>>>>>>> New feature:
>> >>>>>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Andor
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com>
>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi Andor,
>> >>>>>>>>>
>> >>>>>>>>> That's the on disk txn feature, which was disabled
internally
>> after
>> >>>> we
>> >>>>>>>>> found the potentially inconsistent issue. The
only solution we
>> have
>> >>>>>>> for now
>> >>>>>>>>> is waiting for the new digest checking feature
I mentioned in
>> >>>>>>>>> ZOOKEEPER-3114.
>> >>>>>>>>>
>> >>>>>>>>> I think there are some other critical consistent
issues we just
>> >>> fixed
>> >>>>>>> on
>> >>>>>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125,
>> ZOOKEEPER-3127, I
>> >>>>>>> think we
>> >>>>>>>>> should include that in the official 3.5 release
as well.
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Fangmin
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár
<andor@apache.org
>> >
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Jeelani,
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for letting me know. I'm happy to
remove it from the
>> list
>> >>> to
>> >>>>>>> get
>> >>>>>>>>>> closer to a stable release. :)
>> >>>>>>>>>>
>> >>>>>>>>>> What's the feature which can be disabled
to avoid data
>> >>>> inconsistency?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Andor
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani
wrote:
>> >>>>>>>>>>> Thanks Andor for compiling this. Should
we be ignoring
>> >>>>>>> ZOOKEEPER-2418 as
>> >>>>>>>>>> well? This exists in 3.4 as well and the
feature can be
>> disabled.
>> >>> We
>> >>>>>>> are
>> >>>>>>>>>> working on a longer term fix for it in 3.6.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Regards,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jeelani
>> >>>>>>>>>>>
>> >>>>>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar"
>> <andor@cloudera.com.INVALID
>> >>>>
>> >>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Fine.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I'm happy to ignore 1549, 2846 and 2930.
Still we have the
>> list
>> >>>> of:
>> >>>>>>>>>>>
>> >>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for
Atomic Broadcast
>> protocol)
>> >>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for
trunk)
>> >>>>>>>>>>> - ZOOKEEPER-2418 (txnlog diff sync can
skip sending some
>> >>>>>>>>>> transactions to
>> >>>>>>>>>>> followers)
>> >>>>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock
between follower
>> >>> sync
>> >>>>>>>>>> with
>> >>>>>>>>>>> leader and follower receiving external
connection requests.)
>> >>>>>>>>>>>
>> >>>>>>>>>>> SSL (ZK-236) is a feature which essential
for the 3.5 release,
>> >>>>>>> hence
>> >>>>>>>>>> I
>> >>>>>>>>>>> wouldn't leave it out or postpone it
for the next stable
>> >>> release.
>> >>>>>>> PR
>> >>>>>>>>>> has
>> >>>>>>>>>>> been out for a long time, get on reviewing
please.
>> >>>>>>>>>>> The rest are also long outstanding issues
which have been
>> found
>> >>> in
>> >>>>>>>>>> the 3.5
>> >>>>>>>>>>> branch.
>> >>>>>>>>>>> ZK-1818 is something which was found
in 3.4 and fixed in 3.4,
>> >>> but
>> >>>>>>>>>> never has
>> >>>>>>>>>>> been fixed in 3.5. Quite a serious issue
if still present.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I think we should at least run some
manual testing and see if
>> we
>> >>>>>>>>>> could
>> >>>>>>>>>>> repro any of these issues before going
ahead with a stable
>> >>>> release.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Regards,
>> >>>>>>>>>>> Andor
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael
Han <hanm@apache.org>
>> >>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I haven't went through the entire
list, but looks like lots
>> of
>> >>> the
>> >>>>>>>>>> JIRA
>> >>>>>>>>>>>> issues listed in this thread, such
as ZOOKEEPER-1549, 2846,
>> also
>> >>>>>>>>>> affects
>> >>>>>>>>>>>> 3.4 releases. Should we scope these
issues out?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I think historically the single
outstanding blocking issue
>> for a
>> >>>>>>>>>> stable 3.5
>> >>>>>>>>>>>> release is the reconfig feature
and security concerns around
>> it
>> >>>>>>>>>> (somehow
>> >>>>>>>>>>>> addressed in ZOOKEEPER-2014), and
the alpha and beta releases
>> >>> were
>> >>>>>>>>>> created
>> >>>>>>>>>>>> to stabilize that feature.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
>> >>>>>>>>>>>> SSL-release-date-tt7581744.html
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> So it looks like we are in good
shape to release. Something
>> >>> might
>> >>>>>>>>>> worth
>> >>>>>>>>>>>> doing to claim the quality of 3.5
is on par with 3.4
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed
the test for the record
>> >>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
>> >>>>>>>>>>>> * Fix all flaky tests on 3.5 - 3.4
has little or no flaky
>> tests
>> >>> at
>> >>>>>>>>>> all.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM,
Andor Molnar
>> >>>>>>>>>> <andor@cloudera.com.invalid>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks Maoling! That would be
huge help, I appreciate it.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Andor
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>
>> >>>>
>> >
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message