accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Fox <>
Subject Re: Performance of table with large number of column families
Date Fri, 09 Nov 2012 16:45:50 GMT
Yes, there are 10M possible partitions.  I do not have a hash from value to
partition, the data is essentially randomly balanced across all the
tablets.  Unlike the bloom filter and intersecting iterator examples, I do
not have locality groups turned on and I have data in the cq and the value
for both index entries and record entries.  Could this be the issue?  Each
record entry has approximately 30 column qualifiers with data in the value
for each.

On Fri, Nov 9, 2012 at 11:41 AM, William Slacum <> wrote:

> I guess assuming you have 10M possible partitions, if you're using a
> relatively uniform hash to generate your IDs, you'll average about 2 per
> partition. Do you have any index for term/value to partition? This will
> help you narrow down your search space to a subset of your partitions.
> On Fri, Nov 9, 2012 at 11:39 AM, William Slacum <
>> wrote:
>> That shouldn't be a huge issue. How many rows/partitions do you have? How
>> many do you have to scan to find the specific column family/doc id you want?
>> On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox <>wrote:
>>> I have a table set up to use the intersecting iterator pattern.  The
>>> table has about 20M records which leads to 20M column families for the
>>> data section - 1 unique column family per record.  The index section of
>>> the table is not quite as large as the data section.  The rowkey is a
>>> random padded integer partition between 0000000 and 9999999.  I turned
>>> bloom filters on and used the ColumnFamilyFunctor to get performant
>>> column family scans without specifying a range like in the bloom filter
>>> examples in the README.  However, my column family scans (without any
>>> custom iterator) are still fairly slow - ~30 seconds for a column family
>>> batch scan of one record. I've also tried RowFunctor but I see similar
>>> performance.  Can anyone shed any light on the performance metrics I'm
>>> seeing?
>>> Thanks,
>>> Anthony

View raw message