accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Stephens ...@morphism.com>
Subject Re: Scanner.estimatedCount()?
Date Fri, 27 Jun 2014 15:15:31 GMT
Josh,

As you suggested, I don't want to pay the price of a CountingIterator.
Fortunately, I don't care about visibility in this case.  (For a couple of
reasons, one of which is that visibility will be uniformly distributed -- I
think.)

I'm thinking about doing this:

In mutation-writing clients, sample.  Possibly truncate keys to fit what I
need.  For sampled mutations, write them to a table with a summing
combiner.  (I'll probably also have historical stats tables
'sample_20140627T10:12' or whatever, so I can see samples evolve.)  Then
implement Range.getCountEstimate() by querying the sample table with
summing.  Sound reasonable?

--Jamie



On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <josh.elser@gmail.com> wrote:

> You could do this fairly efficiently by leveraging the CountingIterator to
> get an exact count (taking visibilities into account, as well) for the
> range in question. It isn't going to be as fast as a precomputed answer,
> but you could cache that easily.
>
> The fact that visibilities will affect the cardinality of a term makes it
> harder for us to provide this within Accumulo. The situations where
> Accumulo itself cares about cardinality, it's agnostic of the visibilities.
> It would be possible to try to build an index of this information
> internally, but, like Eric said, that's not there today.
>
>
> On 6/27/14, 10:40 AM, Eric Newton wrote:
>
>> Short answer: no.
>>
>> Long answer:
>>
>> You can scan the metadata table for the count/size of the files.
>>
>> You can query tablet servers for the basic stats of every tablet for a
>> given table.  This is used for balancing.
>>
>> But really you should collect the statistics you want during ingest and
>> insert them in another table.
>>
>> -Eric
>>
>>
>> On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <js@morphism.com
>> <mailto:js@morphism.com>> wrote:
>>
>>     Is there a way to get a quick estimate of the number of keys in a
>>     given range?
>>
>>     Perhaps more generally, getting an estimate of the amount of work
>>     (and even some sort of confidence based on, say, the age of
>>     something) to iterate over a range.
>>
>>     I'd like to do some query planning, so statistics like these sure
>>     would be nice.
>>
>>     --Jamie
>>
>>
>>

Mime
View raw message