Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@apache.org
Date: Tue, 31 May 2016 22:10:12 +0000 (UTC)
From: "Josh Elser (JIRA)" <jira@apache.org>
To: notifications@accumulo.apache.org
Message-ID: <JIRA.12726868.1405107731000.343096.1464732612920@Atlassian.JIRA>
In-Reply-To: <JIRA.12726868.1405107731000@Atlassian.JIRA>
References: <JIRA.12726868.1405107731000@Atlassian.JIRA> <JIRA.12726868.1405107731592@arcas>
Subject: [jira] [Commented] (ACCUMULO-2990) BatchWriter never recovers from
 failure(s)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 31 May 2016 22:10:14 -0000


    [ https://issues.apache.org/jira/browse/ACCUMULO-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308752#comment-15308752 ] 

Josh Elser commented on ACCUMULO-2990:
--------------------------------------

bq. Do you think you are going to have a chance to come back to this?

As much as it makes me sad to say so, likely not in a 1.7.2 timeframe. Don't wait around for me. I can always pull it back if necessary.

> BatchWriter never recovers from failure(s)
> ------------------------------------------
>
>                 Key: ACCUMULO-2990
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2990
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.3, 1.8.1
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.
> I repeatedly saw this exception: 
> {noformat}
> 2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry
> org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions 0
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
>         at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
>         at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
>         at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
>         at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
> {noformat}
> The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.
> I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member {{somethingFailed}} is never reset back to {{false}} after the failure is reported. Same goes for {{serverSideErrors}}, {{unknownErrors}}, {{lastUnknownErrors}}, too.
> If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)