accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@uw.edu>
Subject Re: Is there a sensible way to do this? Sequential Batch Scanner
Date Tue, 27 Oct 2015 23:35:44 GMT
Hi Rob,

One solution is to use an Accumulo iterator.  Suppose you want to scan a
set of non-overlapping ranges R.  Use a (non-batch) Scanner, with range
spanning the least start key in R to the greatest end key in R, and a
server-side iterator that works as follows:

   - Pass R to the server-side iterator via iterator options.
   - On a call to seek(Range r, ..., ...) in the iterator: let the iterator
   seek its parent for the first range in R that intersects with r.
   - On a call to next(), if the current seek'ed range is finished, seek
   its parent to the next range in R that intersects with r, until no more
   ranges in R intersect with r.  At that point the scan is finished.

The result is that you can scan a number of non-disjoint ranges with "one
Scanner call" whose results come back in order.  We did this "moving seek
control" into the land of iterators.  One word of caution: if the number of
ranges is very large, you might run into ACCUMULO-3710
<https://issues.apache.org/jira/browse/ACCUMULO-3710> -- too many range
objects get materialized at the tablet server which results in an out of
memory error.

I have implemented something like this in the Graphulo project under
SeekFilterIterator
<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java>
and its related classes.  Take a look at that if you want to try this idea,
and feel free to follow up with questions.

Cheers, Dylan




On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey <rob@maana.io> wrote:

> What I want is something that behaves like a BatchScanner (I.e. Takes a
> collection of Ranges in a single RPC), but preserves the scan ordering.
> I understand this would greatly impact performance, but in my case I can
> manually partition my request on the client, and send one request per
> tablet.
> I can’t use scanners, because in some cases I have 10’s of thousands of
> none consecutive ranges.
> If I use a single threaded BatchScanner, and only request data from a
> single Tablet, am I guaranteed ordering?
> This appears to work correctly in my small tests (albeit slower than a
> single 1 thread Batch scanner call), but I don’t really want to have to
> rely on it if the semantic isn’t guaranteed.
> If not Is there another “efficient” way to do this.
>
> Thanks
>
> Rob Povey
>
>

Mime
View raw message