accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Tablet server thrift issue
Date Sat, 23 Aug 2014 03:41:20 GMT
Josh,

Your advice is definitely useful- I also thought about catching the
exception and retrying with a fresh batch writer but the fact that the
batch writer failure doesn't go away without being re-instantiated is
really only a nuisance. The TabletServerBatchWriter could be designed much
better, I agree, but that is not the root of the problem.

The Thrift exception that is causing the issue is what I'd like to get to
the bottom of. It's throwing the following:

*TApplicationException: applyUpdates failed: out of sequence response *

I've never seen this exception before in regular use of the client API- but
I also just updated to 1.6.0. Google isn't showing anything useful for how
exactly this exception could come about other than using a bad threading
model- and I don't see any drastic changes or other user complaints on the
mailing list that would validate that line of thought. Quite frankly, I'm
stumped. This could be a Thrift exception related to a Thrift bug or
something bad on my system and have nothing to do with Accumulo.

Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
seen the exception before and may remember what it was/how they fixed it.


On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Don't mean to tell you that I don't think there might be a bug/otherwise,
> that's pretty much just the limit of what I know about the server-side
> sessions :)
>
> If you have concrete "this worked in 1.4.4" and "this happens instead with
> 1.6.0", that'd make a great ticket :D
>
> The BatchWriter failure case is pretty rough, actually. Eric has made some
> changes to help already (in 1.6.1, I think), but it needs an overhaul that
> I haven't been able to make time to fix properly, either. IIRC, the only
> guarantee you have is that all mutations added before the last flush()
> happened are durable on the server. Anything else is a guess. I don't know
> the specifics, but that should be enough to work with (and saving off
> mutations shouldn't be too costly since they're stored serialized).
>
>
> On 8/22/14, 5:44 PM, Corey Nolet wrote:
>
>> Thanks Josh,
>>
>> I understand about the session ID completely but the problem I have is
>> that
>> the exact same client code worked, line for line, just fine in 1.4.4 and
>> it's acting up in 1.6.0. I also seem to remember the BatchWriter
>> automatically creating a new session when one expired without an exception
>> causing it to fail on the client.
>>
>> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
>> actual issue of the BatchWriter failing due to the thrift exception rather
>> than just catching the exception and trying mutations again. The other
>> issue is that I've already submitted a bunch of mutations to the batch
>> writer from different threads. Does that mean I need to be storing them
>> off
>> twice? (once in the BatchWriter's cache and once in my own)
>>
>> The BatchWriter in my ingester is constantly sending data and the tablet
>> servers have been given more than enough memory to be able to keep up.
>> There's no swap being used and the network isn't experiencing any errors.
>>
>>
>> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>>  If you get an error from a BatchWriter, you pretty much have to throw
>>> away
>>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
>>> If
>>> you want, you should be able to catch/recover from this without having to
>>> restart the ingester.
>>>
>>> If the session ID is invalid, my guess is that it hasn't been used
>>> recently and the tserver cleaned it up. The exception logic isn't the
>>> greatest (as it just is presented to you as a RTE).
>>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-2990
>>>
>>>
>>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>>
>>>  Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>>> before. Any ideas from anyone else are much appreciated as well.
>>>>
>>>> I recently updated a project's dependencies to Accumulo 1.6.0 built with
>>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>>> component which is running all the time with a batch writer using many
>>>> threads to push mutations into Accumulo.
>>>>
>>>> The issue I'm having is a show stopper. At different intervals of time,
>>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>>> MutationsRejectedExceptions (server errors) from the
>>>> TabletServerBatchWriter. Once they start, I need to restart the ingester
>>>> to
>>>> get them to stop. They always come back within 30 minutes to an hour...
>>>> rinse, repeat.
>>>>
>>>> The exception always happens on different tablet servers. It's a thrift
>>>> error saying a message was received out of sequence. In the TabletServer
>>>> logs, I see an "Invalid session id" exception which happens only once
>>>> before the client-side batch writer starts spitting out the MREs.
>>>>
>>>> I'm running some heavyweight processing in Storm along side the tablet
>>>> servers. I shut that processing off in hopes that maybe it was the
>>>> culprit
>>>> but that hasn't fixed the issue.
>>>>
>>>> I'm surprised I haven't seen any other posts on the topic.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message