accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3811) Improve exception during held commits sent back to clients from BatchWriter
Date Wed, 13 May 2015 17:05:59 GMT


Josh Elser commented on ACCUMULO-3811:

I'm seeing this pretty regularly with DN agitation on, commits held being the cause each time.
I am surprised that I'm seeing this little resilience coming out of HDFS:

Big CI picture w/ agitation
20150513 08:58:26 Killing datanode on cn021
20150513 09:08:26 Starting datanode on cn021

2015-05-13 09:08:16,473 [tserver.TabletServer$ThriftClientHandler] ERROR: Commits are held
org.apache.accumulo.tserver.HoldTimeoutException: Commits are held

2015-05-13 09:08:16,479 [impl.TabletServerBatchWriter] ERROR: Server side error on cn022:9997:
org.apache.thrift.TApplicationException: Internal error processing closeUpdate
2015-05-13 09:08:16,483 [start.Main] ERROR: Thread 'org.apache.accumulo.test.continuous.ContinuousIngest'

Maybe some blocks aren't fully replicated? I'm not sure but I feel like things shouldn't bog
down like this.

> Improve exception during held commits sent back to clients from BatchWriter
> ---------------------------------------------------------------------------
>                 Key: ACCUMULO-3811
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client, tserver
>            Reporter: Josh Elser
>             Fix For: 1.8.0
> Running CI on 1.7.0_rc3, I'm noticing that with datanode agitation, I'm frequently seeing
the BatchWriter die.
> It seems to be that when the ingester is trying to flush right after a datanode dies,
the system is polling to minor compact, which blocks the flush and ultimately results in throwing
a HoldTimeoutException.
> It might be due to under-replication that there are no other datanode available to serve
the necessary block, but it's a good example of how clients have no way to recover from this
case. Client should be able to know if the system is blocking writes and be able to wait and
then retry their update. Right now they just see an opaque AccumuloSecurityException without
reason as to the nature of the failure.

This message was sent by Atlassian JIRA

View raw message