accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Local Combiners to pre-sum at BatchWriter
Date Sat, 04 Apr 2015 20:38:30 GMT
Aren't you  essentially adding another of accumulo node?
On Apr 4, 2015 3:59 PM, "Dylan Hutchison" <dhutchis@mit.edu> wrote:

> I've been thinking about a scenario that seems common among high-ingest
> Accumulo users. Suppose we have a "combiner"-type iterator on a table on
> all scopes.  One technique to increase ingest performance is "pre-summing":
> run the combiner on local entries before they are sent through a
> BatchWriter, in order to reduce the number of entries sent to the tablet
> server.
>
> One way to do pre-summing is to create a Map<Key,Value> of entries to send
> to the server on the local client. This equates to the following client
> code, run for each entry to send to Accumulo:
>
>   Key k = nextKeyToSend();
>   Value v = nextValueToSend();
>   Value vPrev = map.get(k);
>   if (vPrev != null)
>     v = combiner.combine(vPrev, v);
>   map.put(k, v);
>
> Each time our map size exceeds a threshold (don't want to run out of
> memory on the client),
>
>   BatchWriter bw; // setup previously from connector
>   for (Map.Entry<Key,Value> entry : map.entrySet()) {
>     Key k = entry.getKey();
>     Mutation m = new Mutation(k.getRow());
>     m.put(k.getColumnFamily(), k.getColumnQualifier(), entry.getValue());
>     bw.addMutation(m);
>   }
>
> (side note: using one entry change per mutation.  I've never investigated
> whether it would be more efficient to put all the updates to a single row
> [i.e. chaining multiple columns in the same row] in one mutation instead.)
>
> This solution works, but it duplicates the purpose of the BatchWriter and
> adds complexity to the client.  If we have to create a separate "cache"
> collection, track its size and dump to a BatchWriter once it gets too big,
> then it seems like we're reimplementing the behavior of the BatchWriter
> that provides an internal cache of size set by
> BatchWriterConfig.setMaxMemory() (that starts flushing once half the
> maximum memory is used), and we're using two caches (user-created map + the
> BatchWriter) where one should be sufficient.
>
> I'm wondering whether there is a way to pre-sum mutations added to a
> BatchWriter automatically, so that we can add entries to a BatchWriter and
> trust that it will apply a combiner function to them before transmitting to
> the tablet server. Something to the effect of:
>
>   BatchWriter bw; // setup previously from connector
>   Combiner combiner = new SummingCombiner();
>   Map<String, String> combinerOptions = new HashMap<>();
>   combinerOptions.put("all", "true"); // or some other column subset option
>   bw.addCombiner(combiner);
>   // or perhaps more generally/ambitiously: bw.addWriteIterator(combiner);
>
>   // effect: combiner will be applied right before flushing data to server
>   // if the combiner throws an exception, then throw a
> MutationsRejectedException
>
> Is there a better way to accomplish this, without duplicating
> BatchWriter's buffer?  Or would this make a nice addition to the API?  If I
> understand the BatchWriter correctly, it already sorts entries before
> sending to the tablet server, because the tablet server can process them
> more efficiently that way.  If so, the overhead cost seems small to add a
> combining step after the sorting phase and before the network transmit
> phase, especially if it reduces network traffic anyway.
>
> Regards,
> Dylan Hutchison
>
>

Mime
View raw message