bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Olivelli <eolive...@gmail.com>
Subject Re: Help with bad errors on 4.6.1
Date Thu, 08 Mar 2018 20:51:31 GMT
Il gio 8 mar 2018, 21:33 Andrey Yegorov <andrey.yegorov@gmail.com> ha
scritto:

> I am looking more at the PendigAddOp and it looks like, in addition to the
> case that Sijie has fixed, there is another scenario where recycler can get
> triggered.
>

I think that this is another issue, but your idea makes sense to me.

This problem is on bookie side.

>
> I.e.:
> first sendWriteRequest() in PendingAddOp.safeRun() fails in PCBC's
> writeAndFlush (channel == null or channel.writeAndFlush throws)
> hasRun is set to true at that point. pendingWriteRequests is still zero
> (not incremented yet)
> PCBC's errorOut triggers callback that is sure that since
> pendingWriteRequests==0 and hasRun it can release PendingAddOp back to
> recycler
>
> now there are still sendWriteRequest's to follow and all bets are off.
> i.e. by the time they are executed the addOp can get reused so they end up
> sending another request to the wrong bookie.
>
> So far my idea of solution for that is to run PCBC's errorOut
> asynchronously / submit it back to orderedSafeExecutor so addOp can
> increment counts properly.
>
>
>
> ----------
> Andrey Yegorov
>
> On Thu, Mar 8, 2018 at 11:52 AM, Sijie Guo <guosijie@gmail.com> wrote:
>
>>
>>
>> On Thu, Mar 8, 2018 at 7:42 AM, Enrico Olivelli <eolivelli@gmail.com>
>> wrote:
>>
>>>
>>>
>>> 2018-03-08 14:50 GMT+01:00 Ivan Kelly <ivank@apache.org>:
>>>
>>>> It just occurred to me that this could be a problem with the recycler.
>>>> If we recycle a buffer too early, but then keep using it, another user
>>>> could pick it up, and between them they could corrupt the data that
>>>> would cause it to be unhandled.
>>>>
>>>
>>>
>>> Maybe current work from Sijie, Andrey and JV which Sijie is backporting
>>> to branch-4.6 could resolve the issue
>>>
>>
>>
>> I think that is unrelated. The change Andrey, JV and me dealt with is the
>> PendingAddOp. It is on client side not bookie side.
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> -Ivan
>>>>
>>>> On Thu, Mar 8, 2018 at 1:49 PM, Enrico Olivelli <eolivelli@gmail.com>
>>>> wrote:
>>>> >
>>>> >
>>>> > 2018-03-08 11:57 GMT+01:00 Ivan Kelly <ivank@apache.org>:
>>>> >>
>>>> >> Hmm, yes, looks messed up. Do you have a reliable repro of this?
>>>> >>
>>>> >> "at
>>>> io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMessage"
>>>> >> <- it seems the message itself wasn't handled. Do you have any
more
>>>> >> bookie side logs leading up to this event?
>>>> >
>>>> >
>>>> >
>>>> > The only suspicious logs are for logger named
>>>> > "org.apache.bookkeeper.bookie.EntryLogger"
>>>> >
>>>> >
>>>> > 18-03-08-09-22-40       Created new entry log file
>>>> > /data/qasql/bookie/./bookie_data/current/4b0.log for logId 1200.
>>>> > 18-03-08-09-32-41       Failed to get ledgers map index from: 53.log
>>>> : No
>>>> > ledgers map index found on entryLogId53
>>>> > 18-03-08-09-33-10       Failed to get ledgers map index from: 54.log
>>>> : No
>>>> > ledgers map index found on entryLogId54
>>>> > 18-03-08-09-33-10       Failed to get ledgers map index from: 375.log
>>>> : No
>>>> > ledgers map index found on entryLogId375
>>>> > 18-03-08-09-33-13       Failed to get ledgers map index from: 878.log
>>>> : No
>>>> > ledgers map index found on entryLogId878
>>>> > 18-03-08-09-33-22       Failed to get ledgers map index from:
>>>> 1197.log : No
>>>> > ledgers map index found on entryLogId1197
>>>> > 18-03-08-09-33-57       Failed to get ledgers map index from:
>>>> 1198.log : No
>>>> > ledgers map index found on entryLogId1198
>>>> > 18-03-08-09-34-54       Failed to get ledgers map index from:
>>>> 1199.log : No
>>>> > ledgers map index found on entryLogId1199
>>>>
>>>
>>
>> This is also unrelated. It is a fine logging, since it means it attempts
>> to read ledgers map index, but didn't find one.
>>
>> the ledgers map index is only written when it is enabled at the bookies.
>>
>>
>>> >
>>>> > I will try to run Bookie with level debug, but it produces lots of
>>>> spam
>>>> >
>>>> > Thanks
>>>> > Enrico
>>>> >
>>>> >>
>>>> >>
>>>> >> -Ivan
>>>> >>
>>>> >> On Thu, Mar 8, 2018 at 9:18 AM, Enrico Olivelli <eolivelli@gmail.com
>>>> >
>>>> >> wrote:
>>>> >> > Hi all,
>>>> >> > I am seeing this bad errors in some test environments with
4.6.1
>>>> >> > The errors appear during rolling restart of the application,
the
>>>> test
>>>> >> > env is
>>>> >> > made of 6 machines:
>>>> >> > - 3 bookies
>>>> >> > - 3 client machines (with several BK clients, of different
>>>> sub-systems)
>>>> >> > - running with 4.6.1 both client and servers
>>>> >> >
>>>> >> > *I do not have* reports of this errors from production, already
>>>> running
>>>> >> > 4.6.1 for the last month
>>>> >> >
>>>> >> > But the problem is quite scary
>>>> >> >
>>>> >> > This is a sample of relevant errors on clients (in this case
>>>> Majordodo
>>>> >> > brokers, with log level = INFO)
>>>> >> >
>>>> >> > logs/org.apache.bookkeeper.client.PendingAddOp:
>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>> L366634
>>>> >> > E2557
>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>> L366634
>>>> >> > E2558
>>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>>> L366634
>>>> >> > E2559
>>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865):
>>>> Bookie
>>>> >> > operation timeout
>>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865):
>>>> Bookie
>>>> >> > operation timeout
>>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865):
>>>> Bookie
>>>> >> > operation timeout
>>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865):
>>>> Bookie
>>>> >> > operation timeout
>>>> >> > 18-03-07-16-00-05       Failed to write entry (366680, 1865):
>>>> Bookie
>>>> >> > operation timeout
>>>> >> >
>>>> >> >
>>>> >> > org.apache.bookkeeper.proto.PerChannelBookieClient:
>>>> >> >
>>>> >> > 18-03-07-10-06-44       Unexpected exception caught by bookie
>>>> client
>>>> >> > channel
>>>> >> > handler
>>>> >> > 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>>> >> > java.lang.IllegalStateException: Received unknown response
: op
>>>> code = 6
>>>> >> > io.netty.handler.codec.DecoderException:
>>>> >> > java.lang.IllegalStateException:
>>>> >> > Received unknown response : op code = 6
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404)
>>>> >> >         at
>>>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
>>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>>> >> > Caused by: java.lang.IllegalStateException: Received unknown
>>>> response :
>>>> >> > op
>>>> >> > code = 6
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.bookkeeper.proto.Bo
>>>> okieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.bookkeeper.proto.Bo
>>>> okieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:478)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
>>>> >> >
>>>> >> >
>>>> >> >         ... 19 more
>>>> >> >
>>>> >> >
>>>> >> > Errors on Bookie side, with log level = INFO)
>>>> >> >
>>>> >> >
>>>> >> > io.netty.channel.DefaultChannelPipeline:
>>>> >> > 8-03-07-09-51-38       An exceptionCaught() event was fired,
and it
>>>> >> > reached
>>>> >> > at the tail of the pipeline. It usually means the last handler
in
>>>> the
>>>> >> > pipeline did not handle the exception.
>>>> >> > 18-03-07-09-51-38
>>>>  io.netty.util.IllegalReferenceCountException:
>>>> >> > refCnt: 0, increment: 1
>>>> >> > io.netty.util.IllegalReferenceCountException: refCnt: 0,
>>>> increment: 1
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:100)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>>>> >> >         at
>>>> >> >
>>>> io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.java:88)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMessage(DefaultChannelPipeline.java:1202)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline$TailContext.channelRead(DefaultChannelPipeline.java:1299)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.bookkeeper.proto.Bo
>>>> okieRequestHandler.channelRead(BookieRequestHandler.java:78)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.bookkeeper.proto.Au
>>>> thHandler$ServerSideHandler.channelRead(AuthHandler.java:88)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404)
>>>> >> >         at
>>>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
>>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>
>>>
>> questions:
>>
>> 1) which protocol version are you using?
>>
>> 2) is the exception coming from read response or write response? it sound
>> like write response. If it is write response, it would be surprised about
>> the ref count. because we don't send any buffer back in write response.
>>
>>
>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>
> --


-- Enrico Olivelli

Mime
View raw message