zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@apache.org>
Subject Re: ZooKeeper 3.5 blocker issues
Date Fri, 26 Oct 2018 10:33:08 GMT
Hi folks,

You’ve probably realised lots of update emails coming from Jira. Please be aware that we’ve
updated a bunch of open blocker/critical 3.5 tickets to reflect to what we discussed in this
email.

If you open up the following jira filter:

project = ZooKeeper and resolution = Unresolved and fixVersion = 3.5.5 AND priority in (blocker,
critical) ORDER BY priority DESC, key ASC  

You’ll see the most up-to-date list of tickets which need to be addressed before the stable
3.5 release.

Thank you for your efforts to get this done.

Fangmin, ZK-3104 is waiting for backport, but ticket has already been resolved. Have you created
a separate ticket for the backport or shall I just reopen it with the right fix versions?

Thanks,
Andor



> On 2018. Oct 8., at 12:34, Andor Molnar <andor@apache.org> wrote:
> 
> Hi,
> 
> Let me summarize and give a quick update on the outstanding issues for 3.5 GA:
> 
> - ZOOKEEPER-1818 (Fix don't care for trunk)
> - ZOOKEEPER-2778 (Potential server deadlock between follower sync with leader and follower
receiving external connection requests.)
> - ZOOKEEPER-3021 Migrate project structure to Maven (ongoing)
> - ZOOKEEPER-925 Docs generation to Maven
> - ZOOKEEPER-3104 (waiting for backport)
> - ZOOKEEPER-3125 (waiting for backport PR #647)
> 
> The 2 Maven related tickets are no-brainers as well as the backports. ZK-2778 has been
picked up by Maoling (thanks!) as far as I can see, ZK-1818 is the only one waiting for a
volunteer.
> 
> Please correct me if I’ve missed something.
> 
> Regards,
> Andor
> 
> 
> 
> 
>> On 2018. Sep 28., at 18:32, Tamas Penzes <tamaas@cloudera.com.INVALID> wrote:
>> 
>> Hi All,
>> 
>> I would add ZOOKEEPER-3021
>> <https://issues.apache.org/jira/browse/ZOOKEEPER-3021> Migrate project
>> structure to Maven build as a blocker too. Since the migration has started
>> it would be good to finish before releasing ZK 3.5.x GA.
>> 
>> ZOOKEEPER-925 <https://issues.apache.org/jira/browse/ZOOKEEPER-925> replace
>> our forrest site and documentation generation might also be a good idea,
>> since then we could deliver the new MarkDown based documentation.
>> 
>> Regards, Tamaas
>> 
>> On Fri, Sep 14, 2018 at 10:09 AM Fangmin Lv <lvfangmin@gmail.com> wrote:
>> 
>>> Oh, sorry for the confusion, I should provide more context.
>>> 
>>> Leader will use on disk txn sync with followers to if the peer zxid is not
>>> in it's in memory commit logs, the code is here: Leader on disk txn sync
>>> <
>>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L774
>>>> .
>>> There is bug that potentially there will be gap in the txn files, like
>>> after snap sync, etc, so it's possible the peer will miss txns due to this.
>>> 
>>> The option to disable it is snapshotSizeFactor
>>> <
>>> https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZKDatabase.java#L81
>>>> ,
>>> set it to -1 will disable this feature. On 3.5, it's better to have a PR to
>>> set this to -1 by default. It might have more SNAP sync, but from our prod
>>> it doesn't seem to be a big problem to me.
>>> 
>>> I can send out the diff to disable it by default on 3.5 if you guys think
>>> this is the right way to do.
>>> 
>>> Thanks,
>>> Fangmin
>>> 
>>> On Thu, Sep 13, 2018 at 1:58 AM Andor Molnar <andor@apache.org> wrote:
>>> 
>>>> What’s needed to turn it off?
>>>> Do we need a PR or it’s just a config option?
>>>> Shall we implement a feature switch for that and turn it off by default?
>>>> 
>>>> Sorry I don’t have too much insight on disk txn sync.
>>>> 
>>>> Andor
>>>> 
>>>> 
>>>> 
>>>>> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com> wrote:
>>>>> 
>>>>> And to be clear, ZOOKEEPER-2418 is actually just one case of
>>>> inconsistency
>>>>> which could caused by on disk txn sync, as I mentioned in a newer JIRA
>>>>> ZOOKEEPER-2846 <https://issues.apache.org/jira/browse/ZOOKEEPER-2846>,
>>>> the
>>>>> snap sync or txn sync could also leave txns gap in the txn file, which
>>>> is a
>>>>> more common case could trigger this issue.
>>>>> 
>>>>> I would suggest to turn off the on disk txn sync by default for now to
>>>>> avoid this issue, after we finished ZOOKEEPER-3114, we can use that to
>>>>> validate the on disk txns during syncing.
>>>>> 
>>>>> Thanks,
>>>>> Fangmin
>>>>> 
>>>>> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Andor,
>>>>>> 
>>>>>> ZOOKEEPER-3114 is about adding real time digest checking to help
>>>> detecting
>>>>>> inconsistency, it's a new feature with amounts of code change. I'll
>>>> start
>>>>>> upstream it part by part, but I don't expect it's being merged in
the
>>>> next
>>>>>> few weeks. So yes, it's a nice to have, but definitely not a block
for
>>>> 3.5.
>>>>>> 
>>>>>> Thanks,
>>>>>> Fangmin
>>>>>> 
>>>>>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org>
>>> wrote:
>>>>>> 
>>>>>>> Fangmin,
>>>>>>> 
>>>>>>> Sorry, I just noticed that you want to include the consistency
fixes
>>> in
>>>>>>> the stable version which is fine. Let’s finish the backports
and
>>> we’ll
>>>> be
>>>>>>> done with them.
>>>>>>> 
>>>>>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t block
3.5
>>> with
>>>>>>> that. What do you think?
>>>>>>> 
>>>>>>> Andor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org>
wrote:
>>>>>>>> 
>>>>>>>> Cool, thanks for the clarification.
>>>>>>>> 
>>>>>>>> The updated list is as follows:
>>>>>>>> 
>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol)
>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between follower
sync
>>> with
>>>>>>> leader and follower receiving external connection requests.)
>>>>>>>> 
>>>>>>>> The following are not critical and no blockers for the stable
>>> release:
>>>>>>>> 
>>>>>>>> Waiting for to be ported to 3.5:
>>>>>>>> - ZOOKEEPER-3104
>>>>>>>> - ZOOKEEPER-3125
>>>>>>>> - ZOOKEEPER-3127
>>>>>>>> 
>>>>>>>> New feature:
>>>>>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Andor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> Hi Andor,
>>>>>>>>> 
>>>>>>>>> That's the on disk txn feature, which was disabled internally
after
>>>> we
>>>>>>>>> found the potentially inconsistent issue. The only solution
we have
>>>>>>> for now
>>>>>>>>> is waiting for the new digest checking feature I mentioned
in
>>>>>>>>> ZOOKEEPER-3114.
>>>>>>>>> 
>>>>>>>>> I think there are some other critical consistent issues
we just
>>> fixed
>>>>>>> on
>>>>>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125, ZOOKEEPER-3127,
I
>>>>>>> think we
>>>>>>>>> should include that in the official 3.5 release as well.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Fangmin
>>>>>>>>> 
>>>>>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár <andor@apache.org>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jeelani,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks for letting me know. I'm happy to remove it
from the list
>>> to
>>>>>>> get
>>>>>>>>>> closer to a stable release. :)
>>>>>>>>>> 
>>>>>>>>>> What's the feature which can be disabled to avoid
data
>>>> inconsistency?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Andor
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani wrote:
>>>>>>>>>>> Thanks Andor for compiling this. Should we be
ignoring
>>>>>>> ZOOKEEPER-2418 as
>>>>>>>>>> well? This exists in 3.4 as well and the feature
can be disabled.
>>> We
>>>>>>> are
>>>>>>>>>> working on a longer term fix for it in 3.6.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> Jeelani
>>>>>>>>>>> 
>>>>>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar" <andor@cloudera.com.INVALID
>>>> 
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Fine.
>>>>>>>>>>> 
>>>>>>>>>>> I'm happy to ignore 1549, 2846 and 2930. Still
we have the list
>>>> of:
>>>>>>>>>>> 
>>>>>>>>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast
protocol)
>>>>>>>>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>>>>>> - ZOOKEEPER-2418 (txnlog diff sync can skip sending
some
>>>>>>>>>> transactions to
>>>>>>>>>>> followers)
>>>>>>>>>>> - ZOOKEEPER-2778 (Potential server deadlock between
follower
>>> sync
>>>>>>>>>> with
>>>>>>>>>>> leader and follower receiving external connection
requests.)
>>>>>>>>>>> 
>>>>>>>>>>> SSL (ZK-236) is a feature which essential for
the 3.5 release,
>>>>>>> hence
>>>>>>>>>> I
>>>>>>>>>>> wouldn't leave it out or postpone it for the
next stable
>>> release.
>>>>>>> PR
>>>>>>>>>> has
>>>>>>>>>>> been out for a long time, get on reviewing please.
>>>>>>>>>>> The rest are also long outstanding issues which
have been found
>>> in
>>>>>>>>>> the 3.5
>>>>>>>>>>> branch.
>>>>>>>>>>> ZK-1818 is something which was found in 3.4 and
fixed in 3.4,
>>> but
>>>>>>>>>> never has
>>>>>>>>>>> been fixed in 3.5. Quite a serious issue if still
present.
>>>>>>>>>>> 
>>>>>>>>>>> I think we should at least run some manual testing
and see if we
>>>>>>>>>> could
>>>>>>>>>>> repro any of these issues before going ahead
with a stable
>>>> release.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andor
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Sep 7, 2018 at 3:24 AM, Michael Han <hanm@apache.org>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I haven't went through the entire list, but
looks like lots of
>>> the
>>>>>>>>>> JIRA
>>>>>>>>>>>> issues listed in this thread, such as ZOOKEEPER-1549,
2846, also
>>>>>>>>>> affects
>>>>>>>>>>>> 3.4 releases. Should we scope these issues
out?
>>>>>>>>>>>> 
>>>>>>>>>>>> I think historically the single outstanding
blocking issue for a
>>>>>>>>>> stable 3.5
>>>>>>>>>>>> release is the reconfig feature and security
concerns around it
>>>>>>>>>> (somehow
>>>>>>>>>>>> addressed in ZOOKEEPER-2014), and the alpha
and beta releases
>>> were
>>>>>>>>>> created
>>>>>>>>>>>> to stabilize that feature.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
>>>>>>>>>>>> SSL-release-date-tt7581744.html
>>>>>>>>>>>> 
>>>>>>>>>>>> So it looks like we are in good shape to
release. Something
>>> might
>>>>>>>>>> worth
>>>>>>>>>>>> doing to claim the quality of 3.5 is on par
with 3.4
>>>>>>>>>>>> 
>>>>>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed the test
for the record
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
>>>>>>>>>>>> * Fix all flaky tests on 3.5 - 3.4 has little
or no flaky tests
>>> at
>>>>>>>>>> all.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM, Andor Molnar
>>>>>>>>>> <andor@cloudera.com.invalid>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks Maoling! That would be huge help,
I appreciate it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Andor
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
> 


Mime
View raw message