accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Povey <...@maana.io>
Subject Re: Is there a sensible way to do this? Sequential Batch Scanner
Date Wed, 28 Oct 2015 19:55:51 GMT
Thanks, I had thought about trying this, and it’s good to know it’s a viable solution.

However I’m pretty reticent right now to add anymore iterators to our project, they’ve
been a test nightmare for us internally.
Because of the way our internal process works, at any point in time we have many versions
of our product running against a subset of tables in a single Accumulo instance and at least
in 1.6 there doesn’t appear to be a good way to have the tablet servers auto reload the
iterators when builds are updated (you can specify paths to watch, but it doesn't seem to
deal with wild cards). Our internal servers have literally 100’s of tables which require
different versions of iterators so they are in differing HDFS paths.

Thanks

Rob Povey


From: Dylan Hutchison <dhutchis@uw.edu<mailto:dhutchis@uw.edu>>
Reply-To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" <user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Date: Tuesday, October 27, 2015 at 4:35 PM
To: Accumulo User List <user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner

Hi Rob,

One solution is to use an Accumulo iterator.  Suppose you want to scan a set of non-overlapping
ranges R.  Use a (non-batch) Scanner, with range spanning the least start key in R to the
greatest end key in R, and a server-side iterator that works as follows:

  *   Pass R to the server-side iterator via iterator options.
  *   On a call to seek(Range r, ..., ...) in the iterator: let the iterator seek its parent
for the first range in R that intersects with r.
  *   On a call to next(), if the current seek'ed range is finished, seek its parent to the
next range in R that intersects with r, until no more ranges in R intersect with r.  At that
point the scan is finished.

The result is that you can scan a number of non-disjoint ranges with "one Scanner call" whose
results come back in order.  We did this "moving seek control" into the land of iterators.
 One word of caution: if the number of ranges is very large, you might run into ACCUMULO-3710<https://issues.apache.org/jira/browse/ACCUMULO-3710>
-- too many range objects get materialized at the tablet server which results in an out of
memory error.

I have implemented something like this in the Graphulo project under SeekFilterIterator<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java>
and its related classes.  Take a look at that if you want to try this idea, and feel free
to follow up with questions.

Cheers, Dylan




On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey <rob@maana.io<mailto:rob@maana.io>>
wrote:
What I want is something that behaves like a BatchScanner (I.e. Takes a collection of Ranges
in a single RPC), but preserves the scan ordering.
I understand this would greatly impact performance, but in my case I can manually partition
my request on the client, and send one request per tablet.
I can’t use scanners, because in some cases I have 10’s of thousands of none consecutive
ranges.
If I use a single threaded BatchScanner, and only request data from a single Tablet, am I
guaranteed ordering?
This appears to work correctly in my small tests (albeit slower than a single 1 thread Batch
scanner call), but I don’t really want to have to rely on it if the semantic isn’t guaranteed.
If not Is there another “efficient” way to do this.

Thanks

Rob Povey


Mime
View raw message