zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Ziech <christian.zi...@nokia.com>
Subject Re: Data loss for actions happening after a truncate in 3.4.3
Date Mon, 18 Jun 2012 09:37:53 GMT
Created https://issues.apache.org/jira/browse/ZOOKEEPER-1489 - please 
let me know if the test does not fail as intended (did for me) or 
otherwise did not show the problem correctly (or if you need anything else).

Am 15.06.2012 19:51, schrieb ext Patrick Hunt:
> Please do create a jira for this. If you have a reproducible test
> case, or even just steps to reproduce that will be useful. Sounds like
> something we'll need to get into 3.4.4 at the very least.
>
> Thanks for the report!
>
> Patrick
>
> On Fri, Jun 15, 2012 at 9:38 AM, Christian Ziech
> <christian.ziech@nokia.com>  wrote:
>> This issue seems to only affect zookeeper 3.4.3 (and not 3.3.5). Basically
>> it seems that after the truncate method is invoked, the logStream member of
>> the FileTxnLog is still pointing to the old position in the file where it
>> would have written the next entry before the truncate happened. Since the
>> log file is not rolled over or the stream to reset, now a gap in the file is
>> created, that would be interpreted when reading the log as an end of that
>> file.
>>
>> That means once this node becomes leader later on, it would send a snapshot
>> to all its peer that only contains entries up to truncation - all entries
>> thereafter would not be sent. We had this happening on a test cluster on 2/3
>> zookeeper servers while the network connection was bad. Even after the nodes
>> recovered we would loose all the data every time the leader switches to one
>> of those two nodes.
>>
>> Furthermore (and that is a thing I could not 100% reproduce yet) it seems
>> that there are some situations when the transaction log file would not only
>> contain a gap but also just stop after the last entry before the truncation
>> after some leader changes.
>>
>> I have a small program that is able to reproduce the error safely for 3.4.3
>> but not for 3.3.5. That seems to be related to the new leader in 3.3.5 not
>> sending the truncation message to the peer that was more advanced than the
>> new leader, but the actual problem seems also be there in 3.3.5 (I just
>> couldn't get the TRUNC message to be sent in my test).
>>
>> Do other people have encountered the same issue already?
>>
>> I will create a ticket with the test that reproduces the issue later, but
>> before I will need to spend some more time on that script (things are a
>> little hard to reproduce because I have to pull a zookeeper server out of
>> the ensemble for some time without restarting it, to do so I'm using
>> port-forwarding which I can interrupt even on localhost instead of direct
>> connections).
>>
>> What more information do you guys need to investigate the issue?


-- 
*NOKIA*
*Christian Ziech*
Senior Software Developer
Context Based Services
Services & Software
Mobile: +4915155155740
Fax: +493044676555
eMail: christian.ziech@nokia.com
Nokia gate5 GmbH
Invalidenstr. 117
10115 Berlin, Germany
www.maps.nokia.com <http://www.maps.nokia.com>
www.smart2go.com <http://www.smart2go.com>

Nokia gate5 GmbH, Sitz der Gesellschaft: Berlin, Amtsgericht 
Charlottenburg: HRB 106443 B, Steuernr.: 37/222/20817, ID/VAT-Nr.: DE 
812 845 193, Geschäftsführer: Dr. Michael Halbherr, Karim Tähtivuori

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message