accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <michael.m...@gmail.com>
Subject BatchWriter Improvements - An end user's perspective
Date Fri, 26 Aug 2016 16:21:46 GMT
Hello, Folks.

As I look at the following tickets, I thought it might be useful to share
how we are using the BatchWriter, some of the challenges we've had, some
thoughts about it's redesign and how we might get involved.

https://issues.apache.org/jira/browse/ACCUMULO-4154
https://issues.apache.org/jira/browse/ACCUMULO-2589
https://issues.apache.org/jira/browse/ACCUMULO-2990

One of our primary use cases of the BatchWriter is from within a Storm
topology, reading from Kafka. Generally speaking, storm might be persisting
a a single or small set of mutations at a time (low latency), or in larger
batches with Trident (higher throughput). In addition to ACCUMULO-2990 (any
TimedOutException, which then throws MutationsRejectedException and
requires a new connection to be made), one of our requirements is to ensure
that any given thread's mutations are the ones which are flushed and none
others (pseudo transactions). Otherwise, we might get a failure for a
mutation which belongs to another thread (and already ACKed by Storm) which
means we don't have a 'handle' on that offset anymore in Kafka to replay
the failure - i.e. the message could be 'lost'.

Despite being threadsafe, we end up using a single BatchWriter per thread
to make reasoning about the above simpler, but this creates a resource
issue - number of connections to accumulo and zk.

This all makes me wonder what the design goals might have been for the
current version of the driver and if the efforts to rewrite it might
benefit from incorporating elements to address some of these use cases
above.

What can we learn from how drivers for other "NoSQL" databases are
implemented? Would it make sense to remove all the global variables
("somethingFailed"), thread sleep/notify, frequent calls to
"checkForFailures()" and consider using a 'connection pool' model where
writes are single-threaded, linearized and isolated during the connection
lease? Could we make the client non-blocking and with optional pipelining,
so multiple writes could share a connection and allow interleaving of
operations (with individual acks)?

Looking forward to hearing everyone's thoughts.

-Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message