Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92554111DF for ; Fri, 27 Jun 2014 15:16:18 +0000 (UTC) Received: (qmail 16864 invoked by uid 500); 27 Jun 2014 15:16:18 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 16816 invoked by uid 500); 27 Jun 2014 15:16:18 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 16806 invoked by uid 99); 27 Jun 2014 15:16:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jun 2014 15:16:18 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.220.177] (HELO mail-vc0-f177.google.com) (209.85.220.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jun 2014 15:16:13 +0000 Received: by mail-vc0-f177.google.com with SMTP id ij19so5154035vcb.22 for ; Fri, 27 Jun 2014 08:15:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=QDSSk8QHqeMlfyeU6XHfSDjmsHxFEvE67GreTrVXm9E=; b=Me4RDyV8pgQFm0dogKdVbMDogfaiNfH2Ma2iGbbJterP8kQ12YO8B5X4mE69k02Jrf aItndXbdJein69UzPXidKq17mCKUudY7+4wPqtkPfWh7V54bG6iVBDNND2kLubaVxJF2 LwQmVXqkIHxnRN1qNsN9hJWIR7sGCBFbKc8rgEeTxUtbQg/YBWis6+CVz7BtaSOMnSOJ 6ZrxuBnKSSIMzyQEBve0yMkiC38n44hwrpmE+rG3NXubUK+xHR9Xr45cwuxQ/u0+WmfZ Gzqlnk8+LqO+jW0qcrQ5rsO6q2i/Ye6kMqVQziBzNg6aZykIUuvrZkOcjwygf9xZkBxP JWPg== X-Gm-Message-State: ALoCoQlUS64LYBZdatcWY9Z29rKx3fIJMGz00RVqQ0KZt9wII3icytCchUHNU+IXzvzaqn+ysOOD X-Received: by 10.221.27.8 with SMTP id ro8mr20634419vcb.30.1403882152323; Fri, 27 Jun 2014 08:15:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.225.72 with HTTP; Fri, 27 Jun 2014 08:15:31 -0700 (PDT) In-Reply-To: <53AD87EC.90905@gmail.com> References: <53AD87EC.90905@gmail.com> From: Jamie Stephens Date: Fri, 27 Jun 2014 10:15:31 -0500 Message-ID: Subject: Re: Scanner.estimatedCount()? To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a11336baaff020a04fcd2c8b4 X-Virus-Checked: Checked by ClamAV on apache.org --001a11336baaff020a04fcd2c8b4 Content-Type: text/plain; charset=UTF-8 Josh, As you suggested, I don't want to pay the price of a CountingIterator. Fortunately, I don't care about visibility in this case. (For a couple of reasons, one of which is that visibility will be uniformly distributed -- I think.) I'm thinking about doing this: In mutation-writing clients, sample. Possibly truncate keys to fit what I need. For sampled mutations, write them to a table with a summing combiner. (I'll probably also have historical stats tables 'sample_20140627T10:12' or whatever, so I can see samples evolve.) Then implement Range.getCountEstimate() by querying the sample table with summing. Sound reasonable? --Jamie On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser wrote: > You could do this fairly efficiently by leveraging the CountingIterator to > get an exact count (taking visibilities into account, as well) for the > range in question. It isn't going to be as fast as a precomputed answer, > but you could cache that easily. > > The fact that visibilities will affect the cardinality of a term makes it > harder for us to provide this within Accumulo. The situations where > Accumulo itself cares about cardinality, it's agnostic of the visibilities. > It would be possible to try to build an index of this information > internally, but, like Eric said, that's not there today. > > > On 6/27/14, 10:40 AM, Eric Newton wrote: > >> Short answer: no. >> >> Long answer: >> >> You can scan the metadata table for the count/size of the files. >> >> You can query tablet servers for the basic stats of every tablet for a >> given table. This is used for balancing. >> >> But really you should collect the statistics you want during ingest and >> insert them in another table. >> >> -Eric >> >> >> On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens > > wrote: >> >> Is there a way to get a quick estimate of the number of keys in a >> given range? >> >> Perhaps more generally, getting an estimate of the amount of work >> (and even some sort of confidence based on, say, the age of >> something) to iterate over a range. >> >> I'd like to do some query planning, so statistics like these sure >> would be nice. >> >> --Jamie >> >> >> --001a11336baaff020a04fcd2c8b4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Josh,

As you suggested, I= don't want to pay the price of a CountingIterator.=C2=A0 Fortunately, = I don't care about visibility in this case.=C2=A0 (For a couple of reas= ons, one of which is that visibility will be uniformly distributed -- I thi= nk.)

I'm thinking about doing this:

In mutation-writi= ng clients, sample.=C2=A0 Possibly truncate keys to fit what I need.=C2=A0 = For sampled mutations, write them to a table with a summing combiner.=C2=A0= (I'll probably also have historical stats tables 'sample_20140627T= 10:12' or whatever, so I can see samples evolve.)=C2=A0 Then implement = Range.getCountEstimate() by querying the sample table with summing.=C2=A0 S= ound reasonable?

--Jamie


On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <= span dir=3D"ltr"><josh.elser@gmail.com> wrote:
You could do this fairly efficiently by leve= raging the CountingIterator to get an exact count (taking visibilities into= account, as well) for the range in question. It isn't going to be as f= ast as a precomputed answer, but you could cache that easily.

The fact that visibilities will affect the cardinality of a term makes it h= arder for us to provide this within Accumulo. The situations where Accumulo= itself cares about cardinality, it's agnostic of the visibilities. It = would be possible to try to build an index of this information internally, = but, like Eric said, that's not there today.


On 6/27/14, 10:40 AM, Eric Newton wrote:
Short answer: no.

Long answer:

You can scan the metadata table for the count/size of the files.

You can query tablet servers for the basic stats of every tablet for a
given table. =C2=A0This is used for balancing.

But really you should collect the statistics you want during ingest and
insert them in another table.

-Eric


On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <js@morphism.com
<mailto:js@morphism= .com>> wrote:

=C2=A0 =C2=A0 Is there a way to get a quick estimate of the number of keys = in a
=C2=A0 =C2=A0 given range?

=C2=A0 =C2=A0 Perhaps more generally, getting an estimate of the amount of = work
=C2=A0 =C2=A0 (and even some sort of confidence based on, say, the age of =C2=A0 =C2=A0 something) to iterate over a range.

=C2=A0 =C2=A0 I'd like to do some query planning, so statistics like th= ese sure
=C2=A0 =C2=A0 would be nice.

=C2=A0 =C2=A0 --Jamie



--001a11336baaff020a04fcd2c8b4--