bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Yegorov <andrey.yego...@gmail.com>
Subject Re: Help with bad errors on 4.6.1
Date Thu, 08 Mar 2018 20:32:57 GMT
I am looking more at the PendigAddOp and it looks like, in addition to the
case that Sijie has fixed, there is another scenario where recycler can get
triggered.

I.e.:
first sendWriteRequest() in PendingAddOp.safeRun() fails in PCBC's
writeAndFlush (channel == null or channel.writeAndFlush throws)
hasRun is set to true at that point. pendingWriteRequests is still zero
(not incremented yet)
PCBC's errorOut triggers callback that is sure that since
pendingWriteRequests==0 and hasRun it can release PendingAddOp back to
recycler

now there are still sendWriteRequest's to follow and all bets are off. i.e.
by the time they are executed the addOp can get reused so they end up
sending another request to the wrong bookie.

So far my idea of solution for that is to run PCBC's errorOut
asynchronously / submit it back to orderedSafeExecutor so addOp can
increment counts properly.



----------
Andrey Yegorov

On Thu, Mar 8, 2018 at 11:52 AM, Sijie Guo <guosijie@gmail.com> wrote:

>
>
> On Thu, Mar 8, 2018 at 7:42 AM, Enrico Olivelli <eolivelli@gmail.com>
> wrote:
>
>>
>>
>> 2018-03-08 14:50 GMT+01:00 Ivan Kelly <ivank@apache.org>:
>>
>>> It just occurred to me that this could be a problem with the recycler.
>>> If we recycle a buffer too early, but then keep using it, another user
>>> could pick it up, and between them they could corrupt the data that
>>> would cause it to be unhandled.
>>>
>>
>>
>> Maybe current work from Sijie, Andrey and JV which Sijie is backporting
>> to branch-4.6 could resolve the issue
>>
>
>
> I think that is unrelated. The change Andrey, JV and me dealt with is the
> PendingAddOp. It is on client side not bookie side.
>
>
>>
>>
>>
>>
>>
>>>
>>> -Ivan
>>>
>>> On Thu, Mar 8, 2018 at 1:49 PM, Enrico Olivelli <eolivelli@gmail.com>
>>> wrote:
>>> >
>>> >
>>> > 2018-03-08 11:57 GMT+01:00 Ivan Kelly <ivank@apache.org>:
>>> >>
>>> >> Hmm, yes, looks messed up. Do you have a reliable repro of this?
>>> >>
>>> >> "at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>>> ssage"
>>> >> <- it seems the message itself wasn't handled. Do you have any more
>>> >> bookie side logs leading up to this event?
>>> >
>>> >
>>> >
>>> > The only suspicious logs are for logger named
>>> > "org.apache.bookkeeper.bookie.EntryLogger"
>>> >
>>> >
>>> > 18-03-08-09-22-40       Created new entry log file
>>> > /data/qasql/bookie/./bookie_data/current/4b0.log for logId 1200.
>>> > 18-03-08-09-32-41       Failed to get ledgers map index from: 53.log :
>>> No
>>> > ledgers map index found on entryLogId53
>>> > 18-03-08-09-33-10       Failed to get ledgers map index from: 54.log :
>>> No
>>> > ledgers map index found on entryLogId54
>>> > 18-03-08-09-33-10       Failed to get ledgers map index from: 375.log
>>> : No
>>> > ledgers map index found on entryLogId375
>>> > 18-03-08-09-33-13       Failed to get ledgers map index from: 878.log
>>> : No
>>> > ledgers map index found on entryLogId878
>>> > 18-03-08-09-33-22       Failed to get ledgers map index from: 1197.log
>>> : No
>>> > ledgers map index found on entryLogId1197
>>> > 18-03-08-09-33-57       Failed to get ledgers map index from: 1198.log
>>> : No
>>> > ledgers map index found on entryLogId1198
>>> > 18-03-08-09-34-54       Failed to get ledgers map index from: 1199.log
>>> : No
>>> > ledgers map index found on entryLogId1199
>>>
>>
>
> This is also unrelated. It is a fine logging, since it means it attempts
> to read ledgers map index, but didn't find one.
>
> the ledgers map index is only written when it is enabled at the bookies.
>
>
>> >
>>> > I will try to run Bookie with level debug, but it produces lots of spam
>>> >
>>> > Thanks
>>> > Enrico
>>> >
>>> >>
>>> >>
>>> >> -Ivan
>>> >>
>>> >> On Thu, Mar 8, 2018 at 9:18 AM, Enrico Olivelli <eolivelli@gmail.com>
>>> >> wrote:
>>> >> > Hi all,
>>> >> > I am seeing this bad errors in some test environments with 4.6.1
>>> >> > The errors appear during rolling restart of the application, the
>>> test
>>> >> > env is
>>> >> > made of 6 machines:
>>> >> > - 3 bookies
>>> >> > - 3 client machines (with several BK clients, of different
>>> sub-systems)
>>> >> > - running with 4.6.1 both client and servers
>>> >> >
>>> >> > *I do not have* reports of this errors from production, already
>>> running
>>> >> > 4.6.1 for the last month
>>> >> >
>>> >> > But the problem is quite scary
>>> >> >
>>> >> > This is a sample of relevant errors on clients (in this case
>>> Majordodo
>>> >> > brokers, with log level = INFO)
>>> >> >
>>> >> > logs/org.apache.bookkeeper.client.PendingAddOp:
>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>> L366634
>>> >> > E2557
>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>> L366634
>>> >> > E2558
>>> >> > 18-03-07-09-51-43       Write of ledger entry to quorum failed:
>>> L366634
>>> >> > E2559
>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>> >> > operation timeout
>>> >> > 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>> >> > operation timeout
>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>> >> > operation timeout
>>> >> > 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>> >> > operation timeout
>>> >> > 18-03-07-16-00-05       Failed to write entry (366680, 1865): Bookie
>>> >> > operation timeout
>>> >> >
>>> >> >
>>> >> > org.apache.bookkeeper.proto.PerChannelBookieClient:
>>> >> >
>>> >> > 18-03-07-10-06-44       Unexpected exception caught by bookie client
>>> >> > channel
>>> >> > handler
>>> >> > 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>> >> > java.lang.IllegalStateException: Received unknown response : op
>>> code = 6
>>> >> > io.netty.handler.codec.DecoderException:
>>> >> > java.lang.IllegalStateException:
>>> >> > Received unknown response : op code = 6
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:98)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:310)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:297)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>> oMessageDecoder.java:413)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>> ToMessageDecoder.java:265)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>> ead(DefaultChannelPipeline.java:1414)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>> ultChannelPipeline.java:945)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>> tLoop.java:404)
>>> >> >         at
>>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
>>> >> >         at
>>> >> >
>>> >> > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>> gleThreadEventExecutor.java:886)
>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>> >> > Caused by: java.lang.IllegalStateException: Received unknown
>>> response :
>>> >> > op
>>> >> > code = 6
>>> >> >         at
>>> >> >
>>> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
>>> CoderPreV3.decode(BookieProtoEncoding.java:294)
>>> >> >         at
>>> >> >
>>> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
>>> der.decode(BookieProtoEncoding.java:478)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:88)
>>> >> >
>>> >> >
>>> >> >         ... 19 more
>>> >> >
>>> >> >
>>> >> > Errors on Bookie side, with log level = INFO)
>>> >> >
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline:
>>> >> > 8-03-07-09-51-38       An exceptionCaught() event was fired, and
it
>>> >> > reached
>>> >> > at the tail of the pipeline. It usually means the last handler
in
>>> the
>>> >> > pipeline did not handle the exception.
>>> >> > 18-03-07-09-51-38       io.netty.util.IllegalReferenc
>>> eCountException:
>>> >> > refCnt: 0, increment: 1
>>> >> > io.netty.util.IllegalReferenceCountException: refCnt: 0,
>>> increment: 1
>>> >> >         at
>>> >> >
>>> >> > io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs
>>> tractReferenceCountedByteBuf.java:100)
>>> >> >         at
>>> >> >
>>> >> > io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst
>>> ractReferenceCountedByteBuf.java:84)
>>> >> >         at
>>> >> > io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.
>>> java:88)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>>> ssage(DefaultChannelPipeline.java:1202)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline$TailContext.channelR
>>> ead(DefaultChannelPipeline.java:1299)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>> annelInboundHandlerAdapter.java:86)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > org.apache.bookkeeper.proto.BookieRequestHandler.channelRead
>>> (BookieRequestHandler.java:78)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>> annelInboundHandlerAdapter.java:86)
>>> >> >         at
>>> >> >
>>> >> > org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch
>>> annelRead(AuthHandler.java:88)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:102)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:310)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:297)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>> oMessageDecoder.java:413)
>>> >> >         at
>>> >> >
>>> >> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>> ToMessageDecoder.java:265)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>> ead(DefaultChannelPipeline.java:1414)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>> ultChannelPipeline.java:945)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>> >> >         at
>>> >> >
>>> >> > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>> tLoop.java:404)
>>> >> >         at
>>> >> > io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
>>> >> >         at
>>> >> >
>>> >> > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>> gleThreadEventExecutor.java:886)
>>> >> >         at java.base/java.lang.Thread.run(Thread.java:844)
>>>
>>
> questions:
>
> 1) which protocol version are you using?
>
> 2) is the exception coming from read response or write response? it sound
> like write response. If it is write response, it would be surprised about
> the ref count. because we don't send any buffer back in write response.
>
>
>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>

Mime
View raw message