accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <>
Subject [jira] [Updated] (ACCUMULO-2990) BatchWriter never recovers from failure(s)
Date Tue, 09 Jun 2015 16:26:01 GMT


Christopher Tubbs updated ACCUMULO-2990:
    Fix Version/s:     (was: 1.6.3)

> BatchWriter never recovers from failure(s)
> ------------------------------------------
>                 Key: ACCUMULO-2990
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
>          Time Spent: 10m
>  Remaining Estimate: 0h
> In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar
exceptions from two different threads. One of the threads starting working after the unexplained
thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the
lifetime of the test.
> I repeatedly saw this exception: 
> {noformat}
> 2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations
for replication, will retry
> org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations :
0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0
# exceptions 0
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(
>         at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(
>         at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(
>         at
>         at
> {noformat}
> The part that struck me as odd was that the BatchWriter wasn't against the metadata table,
but the replication table.
> I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException,
that BatchWriter becomes useless as the internal member {{somethingFailed}} is never reset
back to {{false}} after the failure is reported. Same goes for {{serverSideErrors}}, {{unknownErrors}},
{{lastUnknownErrors}}, too.
> If this is the case, this is a bug because the BatchWriter should be resilient in this
regard and not force the client to create a new Instance. If that's infeasible to do, we should
add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report
repeatedly report the same failure over and over again.

This message was sent by Atlassian JIRA

View raw message