bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sijie Guo <si...@apache.org>
Subject Re: Bookkeeper Recovery Issues
Date Fri, 25 Nov 2016 15:51:33 GMT
You mentioned that you are running autorecovery at separated nodes, right?
I think I might know what's the problem. Let me double  check and get back
to you tonight.

Sijie

On Nov 25, 2016 6:19 AM, "Sebastián Schepens" <
sebastian.schepens@mercadolibre.com> wrote:

Guys,
Still seeing weird behavior.
Trying autorecovery on separate nodes, they've been running for a long time
just spitting error logs, but apparently moving data, but they never seem
to end and eventually throw OutOfDirectMemory errors (we have assigned 20G
direct memory to them...).

I now have started autorecovery processes of version 4.4 (rest of the
cluster is 4.3), they throw the following errors and do not seem to be
moving data, all node's IO metrics are flat:

2016-11-25 13:55:36,775 - ERROR - [New I/O boss
#9:PerChannelBookieClient$1@255]
- Could not connect to bookie: [id: 0x3dc4f6b8]/10.3.3.41:3181, current
state CONNECTING :
java.net.ConnectException: Connection refused: /10.3.3.41:3181
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:717)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
connect(NioClientBoss.java:152)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
processSelectedKeys(NioClientBoss.java:105)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
process(NioClientBoss.java:79)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(
AbstractNioSelector.java:318)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.run(
NioClientBoss.java:42)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(
ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(
DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

2016-11-25 13:46:46,458 - ERROR - [BookKeeperClientWorker-
orderedsafeexecutor-1-0:BookkeeperInternalCallbacks$MultiCallback@125] -
Error in multi callback : -8
2016-11-25 13:46:46,458 - ERROR - [BookKeeperClientWorker-
orderedsafeexecutor-1-0:LedgerFragmentReplicator$2$1@274] - BK error
writing entry for ledgerId: 9449, entryId: 2536703, b
ookie: 10.3.3.41:3181
org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException
        at org.apache.bookkeeper.client.BKException.create(
BKException.java:62)
        at org.apache.bookkeeper.client.LedgerFragmentReplicator$2$1.
writeComplete(LedgerFragmentReplicator.java:279)
        at org.apache.bookkeeper.proto.BookieClient$1$1.safeRun(
BookieClient.java:189)
        at org.apache.bookkeeper.util.SafeRunnable.run(SafeRunnable.java:31)
        at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The strange thing is that 10.3.3.41 is the autorecovery node, it seems to
be trying to replicate data to itself, is this ok?

It's also attempting to connect to the failed node:

2016-11-25 13:55:36,777 - ERROR - [New I/O boss
#9:PerChannelBookieClient$1@255]
- Could not connect to bookie: [id: 0xb1b27b5b]/10.3.2.158:3181, current
state CONNECTING :
java.net.ConnectException: Connection refused: /10.3.2.158:3181
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:717)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
connect(NioClientBoss.java:152)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
processSelectedKeys(NioClientBoss.java:105)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
process(NioClientBoss.java:79)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(
AbstractNioSelector.java:318)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.run(
NioClientBoss.java:42)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(
ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(
DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

If I execute a manual recovery I get the same errors:
2016-11-25 14:16:56,438 - ERROR - [New I/O boss
#9:PerChannelBookieClient$1@255]
- Could not connect to bookie: [id: 0x3e258d5b]/10.3.2.158:3181, current
state CONNECTING :
java.net.ConnectException: Connection refused: /10.3.2.158:3181
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:717)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
connect(NioClientBoss.java:152)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
processSelectedKeys(NioClientBoss.java:105)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.
process(NioClientBoss.java:79)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(
AbstractNioSelector.java:318)
        at org.jboss.netty.channel.socket.nio.NioClientBoss.run(
NioClientBoss.java:42)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(
ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(
DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

But it seems to be moving data now.

Could you guys give me some insight?
Are these logs normal? If so, why are they level error? Why is the
autorecovery process trying to connect to itself and not replicating data
to other nodes? Is there a way to know the progress of the recovery
process? Is it normal for the recovery processes to run out of direct
memory?

Thanks,
Sebastian


On Wed, Nov 23, 2016 at 11:33 AM Sebastián Schepens <sebastian.schepens@
mercadolibre.com> wrote:

> Sijie,
> Yes, that's precisely what I meant, we're running separate autorecovery
> processes, not daemons on all nodes.
>
> Autorecovery processes run quietly until I stop a node, as soon as I stop
> a node, they're plagued with logs like the following, where the stopped
> node is (10.3.2.56):
>
> 2016-11-22 17:34:06,085 - ERROR - [bookkeeper-io-1-1:
> PerChannelBookieClient$2@284] - Could not connect to bookie: [id:
> 0xd3b0c759, L:/10.3.3.42:45164]/10.3.2.56:3181, current state CONNECTING
> :
> java.net.ConnectException: syscall:getsockopt(...): /10.3.2.56:3181
>
> There seems to be waves of thousands and thousands of these logs while
> some data movement seems to be occurring, but it's really weird that it's
> constantly trying to connect to the failed node.
> Couldn't it realize it's down because it's not shown as available on
> zookeeper?
>
> We also see a couple of this logs, but really few of them compared to the
> previous.
>
> 2016-11-23 14:28:01,661 - WARN  - [ReplicationWorker:
> RackawareEnsemblePlacementPolicy@543] - Failed to choose a bookie:
> excluded [<Bookie:10.3.2.57:3181>, <Bookie:10.3.2.195:3181>
> , <Bookie:10.3.2.158:3181>], fallback to choose bookie randomly from the
> cluster.
>
>
> The cluster currently has 6 nodes, and as I said before we're using
> ensemble size 3, write quorum 3 and ack quorum 2.
>
> Thanks,
> Sebastian
>
> On Tue, Nov 22, 2016 at 2:10 PM Sijie Guo <sijie@apache.org> wrote:
>
> I think what Sebastian said is that manual recovery didn't even work. This
> seems to a bit strange to me. The autorecovery will check if the bookie is
> available or not. After that, it should rereplicate the data from other
> nodes in the ensemble. This seems to indicate something is broken.
> Sebastian, Can you point us some loggings?
>
> Sijie
>
> On Nov 19, 2016 9:46 AM, "Rithin Shetty" <rithin@gmail.com> wrote:
>
> A few things to note: Make sure 'autoRecoveryDaemonEnabled' set to true on
> all the bookie conf files; by default this is false. Otherwise recovery
> will not work. The auto recovery process tries to make sure that the data
> doesn't exist on the source node before replicating to destination. That
> might be the reason why it is trying to talk to the source node.
>
> --Rithin
>
> On Fri, Nov 18, 2016 at 12:00 PM, Sebastián Schepens <sebastian.schepens@
> mercadolibre.com> wrote:
>
> Hi guys,
> I'm running into some issues while trying to recover a decomissioned node.
> Both the recovery command and autorecovery processes fail trying to
> connect to the failing node, which seems reasonable because the node is
> down.
> But I don't get why it's trying to connect to that node, according to the
> documentation it should pull ledger data from other nodes in the ensemble
> (3) and replicate them.
> The recovery command also seems to completely ignore the destination node
> given as third argument.
>
> Could someone give us some help?
> Thanks,
> Sebastian
>
>
>

Mime
View raw message