accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Accumulo Seek performance
Date Wed, 24 Aug 2016 14:33:37 GMT
This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710

I don't feel like 3000 ranges is too many, but this isn't quantitative.

IIRC, the BatchScanner will take each Range you provide, bin each Range 
to the TabletServer(s) currently hosting the corresponding data, clip 
(truncate) each Range to match the Tablet boundaries, and then does an 
RPC to each TabletServer with just the Ranges hosted there.

Inside the TabletServer, it will then have many Ranges, binned by Tablet 
(KeyExtent, to be precise). This will spawn a 
org.apache.accumulo.tserver.scan.LookupTask will will start collecting 
results to send back to the client.

The caveat here is that those ranges are processed serially on a 
TabletServer. Maybe, you're swamping one TabletServer with lots of 
Ranges that it could be processing in parallel.

Could you experiment with using multiple BatchScanners and something 
like Guava's Iterables.concat to make it appear like one Iterator?

I'm curious if we should put an optimization into the BatchScanner 
itself to limit the number of ranges we send in one RPC to a 
TabletServer (e.g. one BatchScanner might open multiple 
MultiScanSessions to a TabletServer).

Sven Hodapp wrote:
> Hi there,
>
> currently we're experimenting with a two node Accumulo cluster (two tablet servers) setup
for document storage.
> This documents are decomposed up to the sentence level.
>
> Now I'm using a BatchScanner to assemble the full document like this:
>
>      val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // ARTIFACTS table
currently hosts ~30GB data, ~200M entries on ~45 tablets
>      bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the ranges-list
>        for (entry<- bscan.asScala) yield {
>          val key = entry.getKey()
>          val value = entry.getValue()
>          // etc.
>        }
>
> For larger full documents (e.g. 3000 exact ranges), this operation will take about 12
seconds.
> But shorter documents are assembled blazing fast...
>
> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
> Is that a normal time for such a (seek) operation?
> Can I do something to get a better seek performance?
>
> Note: I have already enabled bloom filtering on that table.
>
> Thank you for any advice!
>
> Regards,
> Sven
>

Mime
View raw message