incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Scanning for rows using columnfamily only
Date Tue, 01 Nov 2011 21:15:35 GMT
On Tue, Nov 1, 2011 at 4:11 PM, Keith Massey
<> wrote:
> Thanks for the tips. We tried using one locality group per column family (I
> think there are 20-25). It has definitely sped up queries for all data in a
> single column family. The first batch comes back in about 5 seconds rather
> than 120 seconds without the locality groups. Our data load time doubled
> though from 7 hours to 14 hours. I don't have any evidence at this point
> that it is related to the locality groups. But there were very few
> differences between the 7-hour load and the 14-hour load. Any thoughts about
> whether this could be a side effect of loading data into 25 locality groups?
> Or am I looking in the wrong place?
> Thanks again.
> Keith

One experiment worth trying may be to put multiple column families in
a single locality group.  For example, how does putting two column
families in each locality group affect ingest performance and scan
performance.  Then try 4,8, etc.

I may try to run some experiments w/ this also.
> On 10/26/11 6:48 PM, Keith Turner wrote:
>> A few things to consider w/ these options.
>> On Wed, Oct 26, 2011 at 4:13 PM, Adam Fuchs<>  wrote:
>>> Hi Keith,
>>> Sounds like you could use some locality groups! By default, Accumulo
>>> stores
>> Consider the number of locality groups.  For example if you created
>> 100 locality groups, then reading all of them is like reading from 100
>> separate sections of a file at the same time and merging.  This could
>> cause a lot of seeking.  You would read all of them when you do not
>> fetch columns on the scanner.  Having a lot of locality groups may not
>> be a problem, if you always fetch columns.  I have not tested w/ a
>> large number of locality groups.
>>> Another trick you can try is using a BatchScanner instead of a Scanner to
>>> read from multiple nodes in parallel. The tradeoff here is you get better
>>> query latency, but your key/value pairs are likely to come back out of
>>> sorted order. This section of the user manual describes the BatchScanner:
>> Using batch scanner option will parallelize the filtering of data on
>> tablet server side.  So it may be faster, but a lot more work is being
>> done.  This may be be a good option if you do not use locality groups
>> and do not need to run lots of them concurrently.  Lots of concurrent
>> batch scanner could slow down query performance.  For example creating
>> 100 batch scanners w/ 20 threads each will attempt to start 2000
>> threads on the tablet servers to filter data.  Locality groups would
>> be better for lots of concurrent scans.  The accumulo shell use the
>> batch scanner to implement grep.  A use case for the batch scanner
>> that is better for concurrent batch scanners is doing lots of small
>> lookups.
>> Keith

View raw message