incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Scanning for rows using columnfamily only
Date Wed, 02 Nov 2011 21:00:41 GMT
On Wed, Nov 2, 2011 at 4:28 PM, Keith Massey
<keith.massey@digitalreasoning.com> wrote:
> On 11/1/11 3:11 PM, Keith Massey wrote:
>>
>> Thanks for the tips. We tried using one locality group per column family
>> (I think there are 20-25). It has definitely sped up queries for all
>> data in a single column family. The first batch comes back in about 5
>> seconds rather than 120 seconds without the locality groups. Our data
>> load time doubled though from 7 hours to 14 hours. I don't have any
>> evidence at this point that it is related to the locality groups. But
>> there were very few differences between the 7-hour load and the 14-hour
>> load. Any thoughts about whether this could be a side effect of loading
>> data into 25 locality groups? Or am I looking in the wrong place?
>> Thanks again.
>>
>> Keith
>
> Actually I might have spoken too soon. While many queries now come back in
> around 5 seconds that previously took more than 100, some still take a
> really long time. Specifically they seem to be queries for two column
> families that only appear in about 50 rows total (across billions in the
> table). I've lumped these two metadata-type column families into a single
> locality group. I've confirmed that they are recognized as being in a
> locality group. But if I "scan -c
> <column_family_that_is_in_this_locality_group>" in cloudbase shell, it takes
> hundreds of seconds to return all < 50 rows. Was this a bad use of locality
> groups? Should we just put this metadata into its own table? Thanks again.
>
> Keith
>

If you are scanning the entire table, the scanner still needs to go to
each tablet.  On each tablet it may open files, look at file metadata,
and determine nothing is there.  The regular scanner will go through
the tablets sequentially.  The batch scanner would parallelize this.

Enabling the index cache for the table and adjusting the index cache
size may help the file metadata operations on each tablet go faster.
In 1.4 we enabled the index cache for all tables by default.

How many tablets do you have?

Keith

Mime
View raw message