accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Scanner.estimatedCount()?
Date Fri, 27 Jun 2014 15:34:21 GMT
Nice, not having to worry about visibilities makes the problem easier.

I'd encourage you to even consider forgoing sampling. You might be able 
to get by via combination/reduction in your client, and then setting a 
SummingCombiner on your cardinality table. It may be enough to get an 
accurate view of the statistics without a noticeable performance hit. 
But, you know your situation better than I do :)

Let us know how it goes.

On 6/27/14, 11:15 AM, Jamie Stephens wrote:
> Josh,
> As you suggested, I don't want to pay the price of a CountingIterator.
> Fortunately, I don't care about visibility in this case.  (For a couple
> of reasons, one of which is that visibility will be uniformly
> distributed -- I think.)
> I'm thinking about doing this:
> In mutation-writing clients, sample.  Possibly truncate keys to fit what
> I need.  For sampled mutations, write them to a table with a summing
> combiner.  (I'll probably also have historical stats tables
> 'sample_20140627T10:12' or whatever, so I can see samples evolve.)  Then
> implement Range.getCountEstimate() by querying the sample table with
> summing.  Sound reasonable?
> --Jamie
> On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <
> <>> wrote:
>     You could do this fairly efficiently by leveraging the
>     CountingIterator to get an exact count (taking visibilities into
>     account, as well) for the range in question. It isn't going to be as
>     fast as a precomputed answer, but you could cache that easily.
>     The fact that visibilities will affect the cardinality of a term
>     makes it harder for us to provide this within Accumulo. The
>     situations where Accumulo itself cares about cardinality, it's
>     agnostic of the visibilities. It would be possible to try to build
>     an index of this information internally, but, like Eric said, that's
>     not there today.
>     On 6/27/14, 10:40 AM, Eric Newton wrote:
>         Short answer: no.
>         Long answer:
>         You can scan the metadata table for the count/size of the files.
>         You can query tablet servers for the basic stats of every tablet
>         for a
>         given table.  This is used for balancing.
>         But really you should collect the statistics you want during
>         ingest and
>         insert them in another table.
>         -Eric
>         On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <
>         <>
>         < <>>> wrote:
>              Is there a way to get a quick estimate of the number of
>         keys in a
>              given range?
>              Perhaps more generally, getting an estimate of the amount
>         of work
>              (and even some sort of confidence based on, say, the age of
>              something) to iterate over a range.
>              I'd like to do some query planning, so statistics like
>         these sure
>              would be nice.
>              --Jamie

View raw message