accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Accumulo Seek performance
Date Wed, 24 Aug 2016 16:36:42 GMT
Ahh duh. Bad advice from me in the first place :)

Throw 'em in a threadpool locally.

dlmarion@comcast.net wrote:
> Doesn't this use the 6 batch scanners serially?
>
> ------------------------------------------------------------------------
> *From: *"Sven Hodapp" <sven.hodapp@scai.fraunhofer.de>
> *To: *"user" <user@accumulo.apache.org>
> *Sent: *Wednesday, August 24, 2016 11:56:14 AM
> *Subject: *Re: Accumulo Seek performance
>
> Hi Josh,
>
> thanks for your reply!
>
> I've tested your suggestion with a implementation like that:
>
> val ranges500 = ranges.asScala.grouped(500) // this means 6
> BatchScanners will be created
>
> time("mult-scanner") {
> for (ranges <- ranges500) {
> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
> bscan.setRanges(ranges.asJava)
> for (entry <- bscan.asScala) yield {
> entry.getKey()
> }
> }
> }
>
> And the result is a bit disappointing:
>
> background log: info: mult-scanner time: 18064.969281 ms
> background log: info: single-scanner time: 6527.482383 ms
>
> I'm doing something wrong here?
>
>
> Regards,
> Sven
>
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hodapp@scai.fraunhofer.de
> www.scai.fraunhofer.de
>
> ----- Urspr√ľngliche Mail -----
>  > Von: "Josh Elser" <josh.elser@gmail.com>
>  > An: "user" <user@accumulo.apache.org>
>  > Gesendet: Mittwoch, 24. August 2016 16:33:37
>  > Betreff: Re: Accumulo Seek performance
>
>  > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
>  >
>  > I don't feel like 3000 ranges is too many, but this isn't quantitative.
>  >
>  > IIRC, the BatchScanner will take each Range you provide, bin each Range
>  > to the TabletServer(s) currently hosting the corresponding data, clip
>  > (truncate) each Range to match the Tablet boundaries, and then does an
>  > RPC to each TabletServer with just the Ranges hosted there.
>  >
>  > Inside the TabletServer, it will then have many Ranges, binned by Tablet
>  > (KeyExtent, to be precise). This will spawn a
>  > org.apache.accumulo.tserver.scan.LookupTask will will start collecting
>  > results to send back to the client.
>  >
>  > The caveat here is that those ranges are processed serially on a
>  > TabletServer. Maybe, you're swamping one TabletServer with lots of
>  > Ranges that it could be processing in parallel.
>  >
>  > Could you experiment with using multiple BatchScanners and something
>  > like Guava's Iterables.concat to make it appear like one Iterator?
>  >
>  > I'm curious if we should put an optimization into the BatchScanner
>  > itself to limit the number of ranges we send in one RPC to a
>  > TabletServer (e.g. one BatchScanner might open multiple
>  > MultiScanSessions to a TabletServer).
>  >
>  > Sven Hodapp wrote:
>  >> Hi there,
>  >>
>  >> currently we're experimenting with a two node Accumulo cluster (two
> tablet
>  >> servers) setup for document storage.
>  >> This documents are decomposed up to the sentence level.
>  >>
>  >> Now I'm using a BatchScanner to assemble the full document like this:
>  >>
>  >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) //
> ARTIFACTS table
>  >> currently hosts ~30GB data, ~200M entries on ~45 tablets
>  >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the
> ranges-list
>  >> for (entry<- bscan.asScala) yield {
>  >> val key = entry.getKey()
>  >> val value = entry.getValue()
>  >> // etc.
>  >> }
>  >>
>  >> For larger full documents (e.g. 3000 exact ranges), this operation
> will take
>  >> about 12 seconds.
>  >> But shorter documents are assembled blazing fast...
>  >>
>  >> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>  >> Is that a normal time for such a (seek) operation?
>  >> Can I do something to get a better seek performance?
>  >>
>  >> Note: I have already enabled bloom filtering on that table.
>  >>
>  >> Thank you for any advice!
>  >>
>  >> Regards,
>  >> Sven
>

Mime
View raw message