zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@apache.org>
Subject Re: ZooKeeper 3.5 blocker issues
Date Thu, 13 Sep 2018 08:57:56 GMT
What’s needed to turn it off?
Do we need a PR or it’s just a config option?
Shall we implement a feature switch for that and turn it off by default?

Sorry I don’t have too much insight on disk txn sync.

Andor



> On 2018. Sep 13., at 9:16, Fangmin Lv <lvfangmin@gmail.com> wrote:
> 
> And to be clear, ZOOKEEPER-2418 is actually just one case of inconsistency
> which could caused by on disk txn sync, as I mentioned in a newer JIRA
> ZOOKEEPER-2846 <https://issues.apache.org/jira/browse/ZOOKEEPER-2846>, the
> snap sync or txn sync could also leave txns gap in the txn file, which is a
> more common case could trigger this issue.
> 
> I would suggest to turn off the on disk txn sync by default for now to
> avoid this issue, after we finished ZOOKEEPER-3114, we can use that to
> validate the on disk txns during syncing.
> 
> Thanks,
> Fangmin
> 
> On Wed, Sep 12, 2018 at 9:55 AM Fangmin Lv <lvfangmin@gmail.com> wrote:
> 
>> Andor,
>> 
>> ZOOKEEPER-3114 is about adding real time digest checking to help detecting
>> inconsistency, it's a new feature with amounts of code change. I'll start
>> upstream it part by part, but I don't expect it's being merged in the next
>> few weeks. So yes, it's a nice to have, but definitely not a block for 3.5.
>> 
>> Thanks,
>> Fangmin
>> 
>> On Wed, Sep 12, 2018 at 2:55 AM Andor Molnar <andor@apache.org> wrote:
>> 
>>> Fangmin,
>>> 
>>> Sorry, I just noticed that you want to include the consistency fixes in
>>> the stable version which is fine. Let’s finish the backports and we’ll be
>>> done with them.
>>> 
>>> ZOOKEEPER-3114 is essentially a new feature, I wouldn’t block 3.5 with
>>> that. What do you think?
>>> 
>>> Andor
>>> 
>>> 
>>> 
>>>> On 2018. Sep 12., at 11:52, Andor Molnar <andor@apache.org> wrote:
>>>> 
>>>> Cool, thanks for the clarification.
>>>> 
>>>> The updated list is as follows:
>>>> 
>>>> - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol)
>>>> - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>> - ZOOKEEPER-2778 (Potential server deadlock between follower sync with
>>> leader and follower receiving external connection requests.)
>>>> 
>>>> The following are not critical and no blockers for the stable release:
>>>> 
>>>> Waiting for to be ported to 3.5:
>>>> - ZOOKEEPER-3104
>>>> - ZOOKEEPER-3125
>>>> - ZOOKEEPER-3127
>>>> 
>>>> New feature:
>>>> - ZOOKEEPER-3114 (fixes ZOOKEEPER-2184 too)
>>>> 
>>>> Regards,
>>>> Andor
>>>> 
>>>> 
>>>> 
>>>>> On 2018. Sep 12., at 0:42, Fangmin Lv <lvfangmin@gmail.com> wrote:
>>>>> 
>>>>> Hi Andor,
>>>>> 
>>>>> That's the on disk txn feature, which was disabled internally after we
>>>>> found the potentially inconsistent issue. The only solution we have
>>> for now
>>>>> is waiting for the new digest checking feature I mentioned in
>>>>> ZOOKEEPER-3114.
>>>>> 
>>>>> I think there are some other critical consistent issues we just fixed
>>> on
>>>>> master recently: ZOOKEEPER-3104, ZOOKEEPER-3125, ZOOKEEPER-3127, I
>>> think we
>>>>> should include that in the official 3.5 release as well.
>>>>> 
>>>>> Thanks,
>>>>> Fangmin
>>>>> 
>>>>> On Tue, Sep 11, 2018 at 11:58 AM Andor Molnár <andor@apache.org>
>>> wrote:
>>>>> 
>>>>>> Hi Jeelani,
>>>>>> 
>>>>>> 
>>>>>> Thanks for letting me know. I'm happy to remove it from the list
to
>>> get
>>>>>> closer to a stable release. :)
>>>>>> 
>>>>>> What's the feature which can be disabled to avoid data inconsistency?
>>>>>> 
>>>>>> 
>>>>>> Andor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 09/10/2018 11:33 PM, Mohamed Jeelani wrote:
>>>>>>> Thanks Andor for compiling this. Should we be ignoring
>>> ZOOKEEPER-2418 as
>>>>>> well? This exists in 3.4 as well and the feature can be disabled.
We
>>> are
>>>>>> working on a longer term fix for it in 3.6.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Jeelani
>>>>>>> 
>>>>>>> On 9/10/18, 5:19 AM, "Andor Molnar" <andor@cloudera.com.INVALID>
>>> wrote:
>>>>>>> 
>>>>>>>  Fine.
>>>>>>> 
>>>>>>>  I'm happy to ignore 1549, 2846 and 2930. Still we have the list
of:
>>>>>>> 
>>>>>>>  - ZOOKEEPER-236 (SSL/TLS support for Atomic Broadcast protocol)
>>>>>>>  - ZOOKEEPER-1818 (Fix don't care for trunk)
>>>>>>>  - ZOOKEEPER-2418 (txnlog diff sync can skip sending some
>>>>>> transactions to
>>>>>>>  followers)
>>>>>>>  - ZOOKEEPER-2778 (Potential server deadlock between follower
sync
>>>>>> with
>>>>>>>  leader and follower receiving external connection requests.)
>>>>>>> 
>>>>>>>  SSL (ZK-236) is a feature which essential for the 3.5 release,
>>> hence
>>>>>> I
>>>>>>>  wouldn't leave it out or postpone it for the next stable release.
>>> PR
>>>>>> has
>>>>>>>  been out for a long time, get on reviewing please.
>>>>>>>  The rest are also long outstanding issues which have been found
in
>>>>>> the 3.5
>>>>>>>  branch.
>>>>>>>  ZK-1818 is something which was found in 3.4 and fixed in 3.4,
but
>>>>>> never has
>>>>>>>  been fixed in 3.5. Quite a serious issue if still present.
>>>>>>> 
>>>>>>>  I think we should at least run some manual testing and see if
we
>>>>>> could
>>>>>>>  repro any of these issues before going ahead with a stable release.
>>>>>>> 
>>>>>>>  Regards,
>>>>>>>  Andor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  On Fri, Sep 7, 2018 at 3:24 AM, Michael Han <hanm@apache.org>
>>> wrote:
>>>>>>> 
>>>>>>>> I haven't went through the entire list, but looks like lots
of the
>>>>>> JIRA
>>>>>>>> issues listed in this thread, such as ZOOKEEPER-1549, 2846,
also
>>>>>> affects
>>>>>>>> 3.4 releases. Should we scope these issues out?
>>>>>>>> 
>>>>>>>> I think historically the single outstanding blocking issue
for a
>>>>>> stable 3.5
>>>>>>>> release is the reconfig feature and security concerns around
it
>>>>>> (somehow
>>>>>>>> addressed in ZOOKEEPER-2014), and the alpha and beta releases
were
>>>>>> created
>>>>>>>> to stabilize that feature.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zookeeper-2Duser.578899.n2.nabble.com_Zookeeper-2Dwith-2D&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=_tGtL3nMWtuPrXKXDx27AIWOzyyT7W-CjIVLDFZwT0E&e=
>>>>>>>> SSL-release-date-tt7581744.html
>>>>>>>> 
>>>>>>>> So it looks like we are in good shape to release. Something
might
>>>>>> worth
>>>>>>>> doing to claim the quality of 3.5 is on par with 3.4
>>>>>>>> 
>>>>>>>> * Run Jepsen on 3.5 - 3.4 passed the test for the record
>>>>>>>> 
>>>>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__aphyr.com_posts_291-2Djepsen-2Dzookeeper&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=Vl4oKanLQehvaulUvoKg8A&m=wqlhnot9c-pQLdkGkccSGNpELUNUnB-wy_h0iA3PRqI&s=VjORkX5s7hrJyl8mW9Q4cfeSWF4qfTdyRjcuAiBt0y4&e=
>>>>>>>> * Fix all flaky tests on 3.5 - 3.4 has little or no flaky
tests at
>>>>>> all.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Sep 4, 2018 at 1:48 AM, Andor Molnar
>>>>>> <andor@cloudera.com.invalid>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Maoling! That would be huge help, I appreciate
it.
>>>>>>>>> 
>>>>>>>>> Andor
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 


Mime
View raw message