Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8620F11D6A for ; Wed, 16 Jul 2014 16:02:05 +0000 (UTC) Received: (qmail 25040 invoked by uid 500); 16 Jul 2014 16:02:05 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 25003 invoked by uid 500); 16 Jul 2014 16:02:05 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 24985 invoked by uid 99); 16 Jul 2014 16:02:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jul 2014 16:02:05 +0000 Date: Wed, 16 Jul 2014 16:02:05 +0000 (UTC) From: "Keith Turner (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-2990) BatchWriter never recovers from failure(s) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063647#comment-14063647 ] Keith Turner commented on ACCUMULO-2990: ---------------------------------------- Need to correctly handle a situation like the following. # Thread 1 adds 5 mutations to batch writer 1 # Thread 2 adds 6 mutations to batch writer 1 # Thread 2 calls flush() and receives an error, none of the 11 mutations queued were successfully written # Thread 1 calls flush() and receives an error > BatchWriter never recovers from failure(s) > ------------------------------------------ > > Key: ACCUMULO-2990 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2990 > Project: Accumulo > Issue Type: Bug > Components: client > Affects Versions: 1.5.1, 1.6.0 > Reporter: Josh Elser > Priority: Critical > Fix For: 1.5.2, 1.6.1, 1.7.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test. > I repeatedly saw this exception: > {noformat} > 2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry > org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0 security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]} # server errors 0 # exceptions 0 > at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537) > at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249) > at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45) > at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184) > at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124) > at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91) > {noformat} > The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table. > I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member {{somethingFailed}} is never reset back to {{false}} after the failure is reported. Same goes for {{serverSideErrors}}, {{unknownErrors}}, {{lastUnknownErrors}}, too. > If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again. -- This message was sent by Atlassian JIRA (v6.2#6252)