accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dlmar...@comcast.net
Subject Re: Accumulo Seek performance
Date Thu, 25 Aug 2016 14:22:35 GMT
But does toList exhaust the first iterator() before going to the next? 

- Dave 


----- Original Message -----

From: "Sven Hodapp" <sven.hodapp@scai.fraunhofer.de> 
To: "user" <user@accumulo.apache.org> 
Sent: Thursday, August 25, 2016 9:42:00 AM 
Subject: Re: Accumulo Seek performance 

Hi dlmarion, 

toList should also call iterator(), and that is done in independently for each batch scanner
iterator in the context of the Future. 

Regards, 
Sven 

-- 
Sven Hodapp, M.Sc., 
Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
Department of Bioinformatics 
Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
sven.hodapp@scai.fraunhofer.de 
www.scai.fraunhofer.de 

----- Ursprüngliche Mail ----- 
> Von: dlmarion@comcast.net 
> An: "user" <user@accumulo.apache.org> 
> Gesendet: Donnerstag, 25. August 2016 14:34:39 
> Betreff: Re: Accumulo Seek performance 

> Calling BatchScanner.iterator() is what starts the work on the server side. You 
> should do this first for all 6 batch scanners, then iterate over all of them in 
> parallel. 
> 
> ----- Original Message ----- 
> 
> From: "Sven Hodapp" <sven.hodapp@scai.fraunhofer.de> 
> To: "user" <user@accumulo.apache.org> 
> Sent: Thursday, August 25, 2016 4:53:41 AM 
> Subject: Re: Accumulo Seek performance 
> 
> Hi, 
> 
> I've changed the code a little bit, so that it uses a thread pool (via the 
> Future): 
> 
> val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners will 
> be created 
> 
> for (ranges <- ranges500) { 
> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2) 
> bscan.setRanges(ranges.asJava) 
> Future { 
> time("mult-scanner") { 
> bscan.asScala.toList // toList forces the iteration of the iterator 
> } 
> } 
> } 
> 
> Here are the results: 
> 
> background log: info: mult-scanner time: 4807.289358 ms 
> background log: info: mult-scanner time: 4930.996522 ms 
> background log: info: mult-scanner time: 9510.010808 ms 
> background log: info: mult-scanner time: 11394.152391 ms 
> background log: info: mult-scanner time: 13297.247295 ms 
> background log: info: mult-scanner time: 14032.704837 ms 
> 
> background log: info: single-scanner time: 15322.624393 ms 
> 
> Every Future completes independent, but in return every batch scanner iterator 
> needs more time to complete. :( 
> This means the batch scanners aren't really processed in parallel on the server 
> side? 
> Should I reconfigure something? Maybe the tablet servers haven't/can't allocate 
> enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory 
> and a storage with ~300MB/s...) 
> 
> Regards, 
> Sven 
> 
> -- 
> Sven Hodapp, M.Sc., 
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
> Department of Bioinformatics 
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
> sven.hodapp@scai.fraunhofer.de 
> www.scai.fraunhofer.de 
> 
> ----- Ursprüngliche Mail ----- 
>> Von: "Josh Elser" <josh.elser@gmail.com> 
>> An: "user" <user@accumulo.apache.org> 
>> Gesendet: Mittwoch, 24. August 2016 18:36:42 
>> Betreff: Re: Accumulo Seek performance 
> 
>> Ahh duh. Bad advice from me in the first place :) 
>> 
>> Throw 'em in a threadpool locally. 
>> 
>> dlmarion@comcast.net wrote: 
>>> Doesn't this use the 6 batch scanners serially? 
>>> 
>>> ------------------------------------------------------------------------ 
>>> *From: *"Sven Hodapp" <sven.hodapp@scai.fraunhofer.de> 
>>> *To: *"user" <user@accumulo.apache.org> 
>>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM 
>>> *Subject: *Re: Accumulo Seek performance 
>>> 
>>> Hi Josh, 
>>> 
>>> thanks for your reply! 
>>> 
>>> I've tested your suggestion with a implementation like that: 
>>> 
>>> val ranges500 = ranges.asScala.grouped(500) // this means 6 
>>> BatchScanners will be created 
>>> 
>>> time("mult-scanner") { 
>>> for (ranges <- ranges500) { 
>>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1) 
>>> bscan.setRanges(ranges.asJava) 
>>> for (entry <- bscan.asScala) yield { 
>>> entry.getKey() 
>>> } 
>>> } 
>>> } 
>>> 
>>> And the result is a bit disappointing: 
>>> 
>>> background log: info: mult-scanner time: 18064.969281 ms 
>>> background log: info: single-scanner time: 6527.482383 ms 
>>> 
>>> I'm doing something wrong here? 
>>> 
>>> 
>>> Regards, 
>>> Sven 
>>> 
>>> -- 
>>> Sven Hodapp, M.Sc., 
>>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
>>> Department of Bioinformatics 
>>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
>>> sven.hodapp@scai.fraunhofer.de 
>>> www.scai.fraunhofer.de 
>>> 
>>> ----- Ursprüngliche Mail ----- 
>>> > Von: "Josh Elser" <josh.elser@gmail.com> 
>>> > An: "user" <user@accumulo.apache.org> 
>>> > Gesendet: Mittwoch, 24. August 2016 16:33:37 
>>> > Betreff: Re: Accumulo Seek performance 
>>> 
>>> > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710

>>> > 
>>> > I don't feel like 3000 ranges is too many, but this isn't quantitative.

>>> > 
>>> > IIRC, the BatchScanner will take each Range you provide, bin each Range

>>> > to the TabletServer(s) currently hosting the corresponding data, clip 
>>> > (truncate) each Range to match the Tablet boundaries, and then does an 
>>> > RPC to each TabletServer with just the Ranges hosted there. 
>>> > 
>>> > Inside the TabletServer, it will then have many Ranges, binned by Tablet

>>> > (KeyExtent, to be precise). This will spawn a 
>>> > org.apache.accumulo.tserver.scan.LookupTask will will start collecting 
>>> > results to send back to the client. 
>>> > 
>>> > The caveat here is that those ranges are processed serially on a 
>>> > TabletServer. Maybe, you're swamping one TabletServer with lots of 
>>> > Ranges that it could be processing in parallel. 
>>> > 
>>> > Could you experiment with using multiple BatchScanners and something 
>>> > like Guava's Iterables.concat to make it appear like one Iterator? 
>>> > 
>>> > I'm curious if we should put an optimization into the BatchScanner 
>>> > itself to limit the number of ranges we send in one RPC to a 
>>> > TabletServer (e.g. one BatchScanner might open multiple 
>>> > MultiScanSessions to a TabletServer). 
>>> > 
>>> > Sven Hodapp wrote: 
>>> >> Hi there, 
>>> >> 
>>> >> currently we're experimenting with a two node Accumulo cluster (two

>>> tablet 
>>> >> servers) setup for document storage. 
>>> >> This documents are decomposed up to the sentence level. 
>>> >> 
>>> >> Now I'm using a BatchScanner to assemble the full document like this:

>>> >> 
>>> >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
>>> ARTIFACTS table 
>>> >> currently hosts ~30GB data, ~200M entries on ~45 tablets 
>>> >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the

>>> ranges-list 
>>> >> for (entry<- bscan.asScala) yield { 
>>> >> val key = entry.getKey() 
>>> >> val value = entry.getValue() 
>>> >> // etc. 
>>> >> } 
>>> >> 
>>> >> For larger full documents (e.g. 3000 exact ranges), this operation 
>>> will take 
>>> >> about 12 seconds. 
>>> >> But shorter documents are assembled blazing fast... 
>>> >> 
>>> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner? 
>>> >> Is that a normal time for such a (seek) operation? 
>>> >> Can I do something to get a better seek performance? 
>>> >> 
>>> >> Note: I have already enabled bloom filtering on that table. 
>>> >> 
>>> >> Thank you for any advice! 
>>> >> 
>>> >> Regards, 
> >> >> Sven 


Mime
View raw message