accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2990) BatchWriter never recovers from failure(s)
Date Mon, 14 Jul 2014 04:46:05 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060319#comment-14060319
] 

Josh Elser commented on ACCUMULO-2990:
--------------------------------------

There are two paths here that I see we can take.

1. Make the BatchWriter resilient to server-side errors in all branches
2. Only change in HEAD

I think the only worry about doing this in older versions is getting the error handling correct
(which can hopefully be thoroughly tested via mocking). In the case where some users have
code that expect the BatchWriter to become unusable after a MutationsRejectedException is
thrown, having the BatchWriter still be usable would not break their code. It does still have
the possibility of confusion about when this was fixed; however, I think this risk is small
given the BatchWriter never making assertions that it's unusable (looks like a bug, smells
like a bug, etc).

Please make your opinions aware if anyone disagrees with this or if I missed some case.

> BatchWriter never recovers from failure(s)
> ------------------------------------------
>
>                 Key: ACCUMULO-2990
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2990
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.5.2, 1.6.1, 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar
exceptions from two different threads. One of the threads starting working after the unexplained
thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the
lifetime of the test.
> I repeatedly saw this exception: 
> {noformat}
> 2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations
for replication, will retry
> org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations :
0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0
# exceptions 0
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
>         at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
>         at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
>         at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
>         at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
> {noformat}
> The part that struck me as odd was that the BatchWriter wasn't against the metadata table,
but the replication table.
> I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException,
that BatchWriter becomes useless as the internal member {{somethingFailed}} is never reset
back to {{false}} after the failure is reported. Same goes for {{serverSideErrors}}, {{unknownErrors}},
{{lastUnknownErrors}}, too.
> If this is the case, this is a bug because the BatchWriter should be resilient in this
regard and not force the client to create a new Instance. If that's infeasible to do, we should
add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report
repeatedly report the same failure over and over again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message