bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastián Schepens <sebastian.schep...@mercadolibre.com>
Subject Re: Bookkeeper Recovery Issues
Date Wed, 23 Nov 2016 14:33:24 GMT
Sijie,
Yes, that's precisely what I meant, we're running separate autorecovery
processes, not daemons on all nodes.

Autorecovery processes run quietly until I stop a node, as soon as I stop a
node, they're plagued with logs like the following, where the stopped node
is (10.3.2.56):

2016-11-22 17:34:06,085 - ERROR -
[bookkeeper-io-1-1:PerChannelBookieClient$2@284] - Could not connect to
bookie: [id: 0xd3b0c759, L:/10.3.3.42:45164]/10.3.2.56:3181, current state
CONNECTING :
java.net.ConnectException: syscall:getsockopt(...): /10.3.2.56:3181

There seems to be waves of thousands and thousands of these logs while some
data movement seems to be occurring, but it's really weird that it's
constantly trying to connect to the failed node.
Couldn't it realize it's down because it's not shown as available on
zookeeper?

We also see a couple of this logs, but really few of them compared to the
previous.

2016-11-23 14:28:01,661 - WARN  -
[ReplicationWorker:RackawareEnsemblePlacementPolicy@543] - Failed to choose
a bookie: excluded [<Bookie:10.3.2.57:3181>, <Bookie:10.3.2.195:3181>
, <Bookie:10.3.2.158:3181>], fallback to choose bookie randomly from the
cluster.


The cluster currently has 6 nodes, and as I said before we're using
ensemble size 3, write quorum 3 and ack quorum 2.

Thanks,
Sebastian

On Tue, Nov 22, 2016 at 2:10 PM Sijie Guo <sijie@apache.org> wrote:

I think what Sebastian said is that manual recovery didn't even work. This
seems to a bit strange to me. The autorecovery will check if the bookie is
available or not. After that, it should rereplicate the data from other
nodes in the ensemble. This seems to indicate something is broken.
Sebastian, Can you point us some loggings?

Sijie

On Nov 19, 2016 9:46 AM, "Rithin Shetty" <rithin@gmail.com> wrote:

A few things to note: Make sure 'autoRecoveryDaemonEnabled' set to true on
all the bookie conf files; by default this is false. Otherwise recovery
will not work. The auto recovery process tries to make sure that the data
doesn't exist on the source node before replicating to destination. That
might be the reason why it is trying to talk to the source node.

--Rithin

On Fri, Nov 18, 2016 at 12:00 PM, Sebastián Schepens <
sebastian.schepens@mercadolibre.com> wrote:

Hi guys,
I'm running into some issues while trying to recover a decomissioned node.
Both the recovery command and autorecovery processes fail trying to connect
to the failing node, which seems reasonable because the node is down.
But I don't get why it's trying to connect to that node, according to the
documentation it should pull ledger data from other nodes in the ensemble
(3) and replicate them.
The recovery command also seems to completely ignore the destination node
given as third argument.

Could someone give us some help?
Thanks,
Sebastian

Mime
View raw message