accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@mit.edu>
Subject Local Combiners to pre-sum at BatchWriter
Date Sat, 04 Apr 2015 19:58:20 GMT
I've been thinking about a scenario that seems common among high-ingest
Accumulo users. Suppose we have a "combiner"-type iterator on a table on
all scopes.  One technique to increase ingest performance is "pre-summing":
run the combiner on local entries before they are sent through a
BatchWriter, in order to reduce the number of entries sent to the tablet
server.

One way to do pre-summing is to create a Map<Key,Value> of entries to send
to the server on the local client. This equates to the following client
code, run for each entry to send to Accumulo:

  Key k = nextKeyToSend();
  Value v = nextValueToSend();
  Value vPrev = map.get(k);
  if (vPrev != null)
    v = combiner.combine(vPrev, v);
  map.put(k, v);

Each time our map size exceeds a threshold (don't want to run out of memory
on the client),

  BatchWriter bw; // setup previously from connector
  for (Map.Entry<Key,Value> entry : map.entrySet()) {
    Key k = entry.getKey();
    Mutation m = new Mutation(k.getRow());
    m.put(k.getColumnFamily(), k.getColumnQualifier(), entry.getValue());
    bw.addMutation(m);
  }

(side note: using one entry change per mutation.  I've never investigated
whether it would be more efficient to put all the updates to a single row
[i.e. chaining multiple columns in the same row] in one mutation instead.)

This solution works, but it duplicates the purpose of the BatchWriter and
adds complexity to the client.  If we have to create a separate "cache"
collection, track its size and dump to a BatchWriter once it gets too big,
then it seems like we're reimplementing the behavior of the BatchWriter
that provides an internal cache of size set by
BatchWriterConfig.setMaxMemory() (that starts flushing once half the
maximum memory is used), and we're using two caches (user-created map + the
BatchWriter) where one should be sufficient.

I'm wondering whether there is a way to pre-sum mutations added to a
BatchWriter automatically, so that we can add entries to a BatchWriter and
trust that it will apply a combiner function to them before transmitting to
the tablet server. Something to the effect of:

  BatchWriter bw; // setup previously from connector
  Combiner combiner = new SummingCombiner();
  Map<String, String> combinerOptions = new HashMap<>();
  combinerOptions.put("all", "true"); // or some other column subset option
  bw.addCombiner(combiner);
  // or perhaps more generally/ambitiously: bw.addWriteIterator(combiner);

  // effect: combiner will be applied right before flushing data to server
  // if the combiner throws an exception, then throw a
MutationsRejectedException

Is there a better way to accomplish this, without duplicating BatchWriter's
buffer?  Or would this make a nice addition to the API?  If I understand
the BatchWriter correctly, it already sorts entries before sending to the
tablet server, because the tablet server can process them more efficiently
that way.  If so, the overhead cost seems small to add a combining step
after the sorting phase and before the network transmit phase, especially
if it reduces network traffic anyway.

Regards,
Dylan Hutchison

Mime
View raw message