accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: Is there a sensible way to do this? Sequential Batch Scanner
Date Wed, 28 Oct 2015 20:48:30 GMT
Rob,

I would use something like an IteratorChain [1] and fead it
Scanner.iterator() objects. If you setReadaheadThreshold(0) on the scanner
then calling Scanner.iterator() is a fairly lightweight operation, and
you'll be able to plop a bunch of iterators into the IteratorChain so that
they are dynamically activated when you're ready for them. If you want
higher throughput you will have to do something tricky with readahead
thresholds, like writing your own iterator chain and reading ahead on only
a few ScannerIterators at a time. You might not need that to get good
enough performance, though.

[1]
https://commons.apache.org/proper/commons-collections/javadocs/api-2.1.1/org/apache/commons/collections/iterators/IteratorChain.html

Adam

On Wed, Oct 28, 2015 at 4:00 PM, Rob Povey <rob@maana.io> wrote:

> Unfortunately that’s pretty much what I’m doing now, and the results are
> large enough that pulling them back and sorting them causes fairly dramatic
> GC issues.
> If I could get them in sorted order I no longer need to retain them, I can
> just process them and discard them eliminating my GC issues.
> I think the way I’ll end up working around this in the short term is to
> pull pages of data from a batch scanner, sort those, then combine the paged
> results. That should be manageable.
>
> Rob Povey
>
> From: Keith Turner <keith@deenlo.com>
> Reply-To: "user@accumulo.apache.org" <user@accumulo.apache.org>
> Date: Wednesday, October 28, 2015 at 8:04 AM
> To: "user@accumulo.apache.org" <user@accumulo.apache.org>
> Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner
>
> Will the results always fit into memory?  If so could put results from
> batch scanner into ArrayList and sort it.
>
> On Tue, Oct 27, 2015 at 6:21 PM, Rob Povey <rob@maana.io> wrote:
>
>> What I want is something that behaves like a BatchScanner (I.e. Takes a
>> collection of Ranges in a single RPC), but preserves the scan ordering.
>> I understand this would greatly impact performance, but in my case I can
>> manually partition my request on the client, and send one request per
>> tablet.
>> I can’t use scanners, because in some cases I have 10’s of thousands of
>> none consecutive ranges.
>> If I use a single threaded BatchScanner, and only request data from a
>> single Tablet, am I guaranteed ordering?
>> This appears to work correctly in my small tests (albeit slower than a
>> single 1 thread Batch scanner call), but I don’t really want to have to
>> rely on it if the semantic isn’t guaranteed.
>> If not Is there another “efficient” way to do this.
>>
>> Thanks
>>
>> Rob Povey
>>
>>
>

Mime
View raw message