accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Stephens>
Subject Re: Scanner.estimatedCount()?
Date Fri, 27 Jun 2014 15:15:31 GMT

As you suggested, I don't want to pay the price of a CountingIterator.
Fortunately, I don't care about visibility in this case.  (For a couple of
reasons, one of which is that visibility will be uniformly distributed -- I

I'm thinking about doing this:

In mutation-writing clients, sample.  Possibly truncate keys to fit what I
need.  For sampled mutations, write them to a table with a summing
combiner.  (I'll probably also have historical stats tables
'sample_20140627T10:12' or whatever, so I can see samples evolve.)  Then
implement Range.getCountEstimate() by querying the sample table with
summing.  Sound reasonable?


On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <> wrote:

> You could do this fairly efficiently by leveraging the CountingIterator to
> get an exact count (taking visibilities into account, as well) for the
> range in question. It isn't going to be as fast as a precomputed answer,
> but you could cache that easily.
> The fact that visibilities will affect the cardinality of a term makes it
> harder for us to provide this within Accumulo. The situations where
> Accumulo itself cares about cardinality, it's agnostic of the visibilities.
> It would be possible to try to build an index of this information
> internally, but, like Eric said, that's not there today.
> On 6/27/14, 10:40 AM, Eric Newton wrote:
>> Short answer: no.
>> Long answer:
>> You can scan the metadata table for the count/size of the files.
>> You can query tablet servers for the basic stats of every tablet for a
>> given table.  This is used for balancing.
>> But really you should collect the statistics you want during ingest and
>> insert them in another table.
>> -Eric
>> On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <
>> <>> wrote:
>>     Is there a way to get a quick estimate of the number of keys in a
>>     given range?
>>     Perhaps more generally, getting an estimate of the amount of work
>>     (and even some sort of confidence based on, say, the age of
>>     something) to iterate over a range.
>>     I'd like to do some query planning, so statistics like these sure
>>     would be nice.
>>     --Jamie

View raw message