accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Tablet server thrift issue
Date Sat, 23 Aug 2014 02:58:38 GMT
Don't mean to tell you that I don't think there might be a 
bug/otherwise, that's pretty much just the limit of what I know about 
the server-side sessions :)

If you have concrete "this worked in 1.4.4" and "this happens instead 
with 1.6.0", that'd make a great ticket :D

The BatchWriter failure case is pretty rough, actually. Eric has made 
some changes to help already (in 1.6.1, I think), but it needs an 
overhaul that I haven't been able to make time to fix properly, either. 
IIRC, the only guarantee you have is that all mutations added before the 
last flush() happened are durable on the server. Anything else is a 
guess. I don't know the specifics, but that should be enough to work 
with (and saving off mutations shouldn't be too costly since they're 
stored serialized).

On 8/22/14, 5:44 PM, Corey Nolet wrote:
> Thanks Josh,
> I understand about the session ID completely but the problem I have is that
> the exact same client code worked, line for line, just fine in 1.4.4 and
> it's acting up in 1.6.0. I also seem to remember the BatchWriter
> automatically creating a new session when one expired without an exception
> causing it to fail on the client.
> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
> actual issue of the BatchWriter failing due to the thrift exception rather
> than just catching the exception and trying mutations again. The other
> issue is that I've already submitted a bunch of mutations to the batch
> writer from different threads. Does that mean I need to be storing them off
> twice? (once in the BatchWriter's cache and once in my own)
> The BatchWriter in my ingester is constantly sending data and the tablet
> servers have been given more than enough memory to be able to keep up.
> There's no swap being used and the network isn't experiencing any errors.
> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <> wrote:
>> If you get an error from a BatchWriter, you pretty much have to throw away
>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
>> you want, you should be able to catch/recover from this without having to
>> restart the ingester.
>> If the session ID is invalid, my guess is that it hasn't been used
>> recently and the tserver cleaned it up. The exception logic isn't the
>> greatest (as it just is presented to you as a RTE).
>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>> Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>> before. Any ideas from anyone else are much appreciated as well.
>>> I recently updated a project's dependencies to Accumulo 1.6.0 built with
>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>> component which is running all the time with a batch writer using many
>>> threads to push mutations into Accumulo.
>>> The issue I'm having is a show stopper. At different intervals of time,
>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>> MutationsRejectedExceptions (server errors) from the
>>> TabletServerBatchWriter. Once they start, I need to restart the ingester
>>> to
>>> get them to stop. They always come back within 30 minutes to an hour...
>>> rinse, repeat.
>>> The exception always happens on different tablet servers. It's a thrift
>>> error saying a message was received out of sequence. In the TabletServer
>>> logs, I see an "Invalid session id" exception which happens only once
>>> before the client-side batch writer starts spitting out the MREs.
>>> I'm running some heavyweight processing in Storm along side the tablet
>>> servers. I shut that processing off in hopes that maybe it was the culprit
>>> but that hasn't fixed the issue.
>>> I'm surprised I haven't seen any other posts on the topic.
>>> Thanks!

View raw message