incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Massey <keith.mas...@digitalreasoning.com>
Subject Re: Scanning for rows using columnfamily only
Date Tue, 01 Nov 2011 20:11:56 GMT
Thanks for the tips. We tried using one locality group per column family 
(I think there are 20-25). It has definitely sped up queries for all 
data in a single column family. The first batch comes back in about 5 
seconds rather than 120 seconds without the locality groups. Our data 
load time doubled though from 7 hours to 14 hours. I don't have any 
evidence at this point that it is related to the locality groups. But 
there were very few differences between the 7-hour load and the 14-hour 
load. Any thoughts about whether this could be a side effect of loading 
data into 25 locality groups? Or am I looking in the wrong place?
Thanks again.

Keith

On 10/26/11 6:48 PM, Keith Turner wrote:
> A few things to consider w/ these options.
>
> On Wed, Oct 26, 2011 at 4:13 PM, Adam Fuchs<adam.p.fuchs@ugov.gov>  wrote:
>> Hi Keith,
>> Sounds like you could use some locality groups! By default, Accumulo stores
> Consider the number of locality groups.  For example if you created
> 100 locality groups, then reading all of them is like reading from 100
> separate sections of a file at the same time and merging.  This could
> cause a lot of seeking.  You would read all of them when you do not
> fetch columns on the scanner.  Having a lot of locality groups may not
> be a problem, if you always fetch columns.  I have not tested w/ a
> large number of locality groups.
>
>> Another trick you can try is using a BatchScanner instead of a Scanner to
>> read from multiple nodes in parallel. The tradeoff here is you get better
>> query latency, but your key/value pairs are likely to come back out of
>> sorted order. This section of the user manual describes the BatchScanner:
>> http://incubator.apache.org/accumulo/user_manual_1.3-incubating/Writing_Accumulo_Clients.html#SECTION00520000000000000000
>>
> Using batch scanner option will parallelize the filtering of data on
> tablet server side.  So it may be faster, but a lot more work is being
> done.  This may be be a good option if you do not use locality groups
> and do not need to run lots of them concurrently.  Lots of concurrent
> batch scanner could slow down query performance.  For example creating
> 100 batch scanners w/ 20 threads each will attempt to start 2000
> threads on the tablet servers to filter data.  Locality groups would
> be better for lots of concurrent scans.  The accumulo shell use the
> batch scanner to implement grep.  A use case for the batch scanner
> that is better for concurrent batch scanners is doing lots of small
> lookups.
>
> Keith

Mime
View raw message