zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fangmin Lv <lvfang...@gmail.com>
Subject Re: ZooKeeper 3.5 blocker issues
Date Fri, 02 Nov 2018 18:46:03 GMT
Hi Andor,

Is anyone working on ZK-2778? I can pick it up if there is no one working
on it yet.

I'll open a 3.5 PR for ZK-3104 today.

Fangmin

On Fri, Oct 26, 2018 at 3:33 AM Andor Molnar <andor@apache.org> wrote:

> Hi folks,
>
> You’ve probably realised lots of update emails coming from Jira. Please be
> aware that we’ve updated a bunch of open blocker/critical 3.5 tickets to
> reflect to what we discussed in this email.
>
> If you open up the following jira filter:
>
> project = ZooKeeper and resolution = Unresolved and fixVersion = 3.5.5 AND
> priority in (blocker, critical) ORDER BY priority DESC, key ASC
>
> You’ll see the most up-to-date list of tickets which need to be addressed
> before the stable 3.5 release.
>
> Thank you for your efforts to get this done.
>
> Fangmin, ZK-3104 is waiting for backport, but ticket has already been
> resolved. Have you created a separate ticket for the backport or shall I
> just reopen it with the right fix versions?
>
> Thanks,
> Andor
>
>
>
> > On 2018. Oct 8., at 12:34, Andor Molnar <andor@apache.org> wrote:
> >
> > Hi,
> >
> > Let me summarize and give a quick update on the outstanding issues for
> 3.5 GA:
> >
> > - ZOOKEEPER-1818 (Fix don't care for trunk)
> > - ZOOKEEPER-2778 (Potential server deadlock between follower sync with
> leader and follower receiving external connection requests.)
> > - ZOOKEEPER-3021 Migrate project structure to Maven (ongoing)
> > - ZOOKEEPER-925 Docs generation to Maven
> > - ZOOKEEPER-3104 (waiting for backport)
> > - ZOOKEEPER-3125 (waiting for backport PR #647)
> >
> > The 2 Maven related tickets are no-brainers as well as the backports.
> ZK-2778 has been picked up by Maoling (thanks!) as far as I can see,
> ZK-1818 is the only one waiting for a volunteer.
> >
> > Please correct me if I’ve missed something.
> >
> > Regards,
> > Andor
> >
> >
> >
> >
> >> On 2018. Sep 28., at 18:32, Tamas Penzes <tamaas@cloudera.com.INVALID>
> wrote:
> >>
> >> Hi All,
> >>
> >> I would add ZOOKEEPER-3021
> >> <https://issues.apache.org/jira/browse/ZOOKEEPER-3021> Migrate project
> >> structure to Maven build as a blocker too. Since the migration has
> started
> >> it would be good to finish before releasing ZK 3.5.x GA.
> >>
> >> ZOOKEEPER-925 <https://issues.apache.org/jira/browse/ZOOKEEPER-925>
> replace
> >> our forrest site and documentation generation might also be a good idea,
> >> since then we could deliver the new MarkDown based documentation.
> >>
> >> Regards, Tamaas
> >>
> >> On Fri, Sep 14, 2018 at 10:09 AM Fangmin Lv <lvfangmin@gmail.com>
> wrote:
> >>
> >>> Oh, sorry for the confusion, I should provide more context.
> >>>
> >>> Leader will use on disk txn sync with followers to if the peer zxid is
> not
> >>> in it's in memory commit logs, the code is here: Leader on disk txn
> sync
> >>> <
> >>>
> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774
> >>>> .
> >>> There is bug that potentially there will be gap in the txn files, like
> >>> after snap sync, etc, so it's possible the peer will miss txns due to
> this.
> >>>
> >>> The option to disable it is snapshotSizeFactor
> >>> <
> >>>
> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81
> >>>> ,
> >>> set it to -1 will disable this feature. On 3.5, it's better to have a
> PR to
> >>> set this to -1 by default. It might have more SNAP sync, but from our
> prod
> >>> it doesn't seem to be a big problem to me.
> >>>
> >>> I can send out the diff to disable it by default on 3.5 if you guys
> think
> >>> this is the right way to do.
> >>>
> >>> Thanks,
> >>> Fangmin
> >>>
> >>> On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <andor@apache.org> wrote:
> >>>
> >>>> What’s needed to turn it off?
> >>>> Do we need a PR or it’s just a config option?
> >>>> Shall we implement a feature switch for that and turn it off by
> default?
> >>>>
> >>>> Sorry I don’t have too much insight on disk txn sync.
> >>>>
> >>>> Andor
> >>>>
> >>>>
> >>>>
> >>>>> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com>
wrote:
> >>>>>
> >>>>> And to be clear, ZOOKEEPER-2418 is actually just one case of
> >>>> inconsistency
> >>>>> which could caused by on disk txn sync, as I mentioned in a newer
> JIRA
> >>>>> ZOOKEEPER-2846 <https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> >,
> >>>> the
> >>>>> snap sync or txn sync could also leave txns gap in the txn file,
> which
> >>>> is a
> >>>>> more common case could trigger this issue.
> >>>>>
> >>>>> I would suggest to turn off the on disk txn sync by default for
now
> to
> >>>>> avoid this issue, after we finished ZOOKEEPER-3114, we can use that
> to
> >>>>> validate the on disk txns during syncing.
> >>>>>
> >>>>> Thanks,
> >>>>> Fangmin
> >>>>>
> >>>>> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Andor,
> >>>>>>
> >>>>>> ZOOKEEPER-3114 is about adding real time digest checking to
help
> >>>> detecting
> >>>>>> inconsistency, it's a new feature with amounts of code change.
I'll
> >>>> start
> >>>>>> upstream it part by part, but I don't expect it's being merged
in
> the
> >>>> next
> >>>>>> few weeks. So yes, it's a nice to have, but definitely not a
block
> for
> >>>> 3.5.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Fangmin
> >>>>>>
> >>>>>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> Fangmin,
> >>>>>>>
> >>>>>>> Sorry, I just noticed that you want to include the consistency
> fixes
> >>> in
> >>>>>>> the stable version which is fine. Let’s finish the backports
and
> >>> we’ll
> >>>> be
> >>>>>>> done with them.
> >>>>>>>
> >>>>>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t
block 3.5
> >>> with
> >>>>>>> that. What do you think?
> >>>>>>>
> >>>>>>> Andor
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org>
> wrote:
> >>>>>>>>
> >>>>>>>> Cool, thanks for the clarification.
> >>>>>>>>
> >>>>>>>> The updated list is as follows:
> >>>>>>>>
> >>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast
protocol)
> >>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
> >>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between
follower sync
> >>> with
> >>>>>>> leader and follower receiving external connection requests.)
> >>>>>>>>
> >>>>>>>> The following are not critical and no blockers for the
stable
> >>> release:
> >>>>>>>>
> >>>>>>>> Waiting for to be ported to 3.5:
> >>>>>>>> - ZOOKEEPER-3104
> >>>>>>>> - ZOOKEEPER-3125
> >>>>>>>> - ZOOKEEPER-3127
> >>>>>>>>
> >>>>>>>> New feature:
> >>>>>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Andor
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Andor,
> >>>>>>>>>
> >>>>>>>>> That's the on disk txn feature, which was disabled
internally
> after
> >>>> we
> >>>>>>>>> found the potentially inconsistent issue. The only
solution we
> have
> >>>>>>> for now
> >>>>>>>>> is waiting for the new digest checking feature I
mentioned in
> >>>>>>>>> ZOOKEEPER-3114.
> >>>>>>>>>
> >>>>>>>>> I think there are some other critical consistent
issues we just
> >>> fixed
> >>>>>>> on
> >>>>>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125,
ZOOKEEPER-3127,
> I
> >>>>>>> think we
> >>>>>>>>> should include that in the official 3.5 release
as well.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Fangmin
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár <andor@apache.org>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jeelani,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks for letting me know. I'm happy to remove
it from the list
> >>> to
> >>>>>>> get
> >>>>>>>>>> closer to a stable release. :)
> >>>>>>>>>>
> >>>>>>>>>> What's the feature which can be disabled to
avoid data
> >>>> inconsistency?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Andor
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani wrote:
> >>>>>>>>>>> Thanks Andor for compiling this. Should
we be ignoring
> >>>>>>> ZOOKEEPER-2418 as
> >>>>>>>>>> well? This exists in 3.4 as well and the feature
can be
> disabled.
> >>> We
> >>>>>>> are
> >>>>>>>>>> working on a longer term fix for it in 3.6.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>>
> >>>>>>>>>>> Jeelani
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar"
> <andor@cloudera.com.INVALID
> >>>>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Fine.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm happy to ignore 1549, 2846 and 2930.
Still we have the list
> >>>> of:
> >>>>>>>>>>>
> >>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic
Broadcast protocol)
> >>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
> >>>>>>>>>>> - ZOOKEEPER-2418 (txnlog diff sync can skip
sending some
> >>>>>>>>>> transactions to
> >>>>>>>>>>> followers)
> >>>>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock
between follower
> >>> sync
> >>>>>>>>>> with
> >>>>>>>>>>> leader and follower receiving external connection
requests.)
> >>>>>>>>>>>
> >>>>>>>>>>> SSL (ZK-236) is a feature which essential
for the 3.5 release,
> >>>>>>> hence
> >>>>>>>>>> I
> >>>>>>>>>>> wouldn't leave it out or postpone it for
the next stable
> >>> release.
> >>>>>>> PR
> >>>>>>>>>> has
> >>>>>>>>>>> been out for a long time, get on reviewing
please.
> >>>>>>>>>>> The rest are also long outstanding issues
which have been found
> >>> in
> >>>>>>>>>> the 3.5
> >>>>>>>>>>> branch.
> >>>>>>>>>>> ZK-1818 is something which was found in
3.4 and fixed in 3.4,
> >>> but
> >>>>>>>>>> never has
> >>>>>>>>>>> been fixed in 3.5. Quite a serious issue
if still present.
> >>>>>>>>>>>
> >>>>>>>>>>> I think we should at least run some manual
testing and see if
> we
> >>>>>>>>>> could
> >>>>>>>>>>> repro any of these issues before going ahead
with a stable
> >>>> release.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Andor
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael
Han <hanm@apache.org>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I haven't went through the entire list,
but looks like lots of
> >>> the
> >>>>>>>>>> JIRA
> >>>>>>>>>>>> issues listed in this thread, such as
ZOOKEEPER-1549, 2846,
> also
> >>>>>>>>>> affects
> >>>>>>>>>>>> 3.4 releases. Should we scope these
issues out?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think historically the single outstanding
blocking issue
> for a
> >>>>>>>>>> stable 3.5
> >>>>>>>>>>>> release is the reconfig feature and
security concerns around
> it
> >>>>>>>>>> (somehow
> >>>>>>>>>>>> addressed in ZOOKEEPER-2014), and the
alpha and beta releases
> >>> were
> >>>>>>>>>> created
> >>>>>>>>>>>> to stabilize that feature.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
> >>>>>>>>>>>> SSL-release-date-tt7581744.html
> >>>>>>>>>>>>
> >>>>>>>>>>>> So it looks like we are in good shape
to release. Something
> >>> might
> >>>>>>>>>> worth
> >>>>>>>>>>>> doing to claim the quality of 3.5 is
on par with 3.4
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed the
test for the record
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
> >>>>>>>>>>>> * Fix all flaky tests on 3.5 - 3.4 has
little or no flaky
> tests
> >>> at
> >>>>>>>>>> all.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM, Andor
Molnar
> >>>>>>>>>> <andor@cloudera.com.invalid>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks Maoling! That would be huge
help, I appreciate it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Andor
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message