zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@apache.org>
Subject Re: ZooKeeper 3.4 to 3.5.x upgrade: "No snapshot found, but there are log entries. Something is broken!"
Date Tue, 05 Jun 2018 04:36:55 GMT
Hi Sijie,

>> I am just curious why the change was made in such way.

It's a safety guarantee. Consider this case:

* An ensemble of server A, B, and C. A and B have most up to date
transactions (let's say zxid + 1) while C is lagging one transaction behind
(C has zxid). A is the current leader.
* A is partitioned away. And for some reasons B lost its snapshot file (for
example an admin 'rm -rf' the entire dataDir by mistake) at the same time.
* Now with B and C, if we don't do the check, B will be elected as leader
as it has most up to date transaction (zxid + 1). The state of the ensemble
will be set as B's state, which is incorrect as although it has most up to
date transactions, it lost the old state with the missing snapshot file.
* In this case, we'd rather have the system stops working, by disallowing B
participate leader election, rather than having a working system with
incorrect state.

Note the only case that we allow an empty snapshot file is when B is
bootstrapped as a new server joining the quorum.

>> Also can you advice the steps for people who using 3.4.x to upgrade to
3.5.4-beta

The only catch I remember is that if you are using a version older than
3.4.6, you'd need to upgrade through 3.4.6 first before upgrading to 3.5.x,
if you are doing a rolling upgrade and want to keep the liveness of the
quorum. See more
https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#ch_reconfig_upgrade
.


On Mon, Jun 4, 2018 at 5:40 PM, Sijie Guo <guosijie@gmail.com> wrote:

> Hi zookeeper team,
>
>
> We hit an issue when upgrading from 3.4.x to 3.5.4-beta. Need some
> helps/advices from the community.
>
> ```
> *10:14:55.607 [main] INFO  org.apache.zookeeper.server.
> NIOServerCnxnFactory
> - binding to port 0.0.0.0/0.0.0.0:2181 <http://0.0.0.0/0.0.0.0:2181>*
> *10:14:55.623 [main] ERROR
> org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble - Exception while
> instantiating ZooKeeper*
> *java.io.IOException: No snapshot found, but there are log entries.
> Something is broken!*
> *    at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.
> restore(FileTxnSnapLog.java:206)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.zookeeper.server.ZooKeeperServer.loadData(
> ZooKeeperServer.java:284)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.zookeeper.server.ZooKeeperServer.startdata(
> ZooKeeperServer.java:444)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.zookeeper.server.NIOServerCnxnFactory.startup(
> NIOServerCnxnFactory.java:764)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.zookeeper.server.ServerCnxnFactory.startup(
> ServerCnxnFactory.java:98)
> ~[pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.runZookeeper(
> LocalBookkeeperEnsemble.java:126)
> [pulsar-zookeeper-utils.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.startStandalone(
> LocalBookkeeperEnsemble.java:242)
> [pulsar-zookeeper-utils.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.pulsar.PulsarStandaloneStarter.start(
> PulsarStandaloneStarter.java:171)
> [pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> *    at
> org.apache.pulsar.PulsarStandaloneStarter.main(
> PulsarStandaloneStarter.java:266)
> [pulsar-broker.jar:2.1.0-incubating-SNAPSHOT]*
> ```
>
>
> Looking into the source code,
> https://github.com/apache/zookeeper/blob/release-3.5.4/
> src/java/main/org/apache/zookeeper/server/persistence/
> FileTxnSnapLog.java#L206
>
> A fix was introduced in https://issues.apache.org/
> jira/browse/ZOOKEEPER-2325
> to throw exception when there is no snapshots and txn log is not empty.
>
> I am just curious why the change was made in such way. my feeling in a
> snapshotting-based store, if there is no snapshots but there are log
> entries, it usually doesn't mean the state was corrupted. I guess I might
> miss some context behind ZOOKEEPER-2325.
>
>
> Also can you advice the steps for people who using 3.4.x to upgrade to
> 3.5.4-beta?
>
> Thanks,
> Sijie
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message