Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB4D911290 for ; Fri, 27 Jun 2014 15:34:45 +0000 (UTC) Received: (qmail 74358 invoked by uid 500); 27 Jun 2014 15:34:45 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 74305 invoked by uid 500); 27 Jun 2014 15:34:45 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 74295 invoked by uid 99); 27 Jun 2014 15:34:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jun 2014 15:34:45 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com designates 209.85.192.49 as permitted sender) Received: from [209.85.192.49] (HELO mail-qg0-f49.google.com) (209.85.192.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jun 2014 15:34:40 +0000 Received: by mail-qg0-f49.google.com with SMTP id f51so4481390qge.36 for ; Fri, 27 Jun 2014 08:34:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=WZBVmNouTOEMZRj2LEQATIsd9e7pw5c3XUFgMbERRqM=; b=FEJK3gUddgi5zBGGp68W4ZhT2Vx7wUUS4hQ6yddRHWJOiUpLNAVQ87nP0HDHyBcw4b L4oMq1DcnAaGP5wIAJt9J+rptXlI7E+LCvO/zkPWBi8QZY0F+6JxEqT4TIUanRzCUSC0 T/0v9JexOyfhq5ack+HkAjYBUb32d8fck3YXp8R45Vgp92bx4beyFHpqY7qTrTgZU8No 9UkNnAOduTY0qc0Q3aV+PchzUqH4YbykPKGSXv4Ca1p7NIafl9q3l90WMYyGwTJyJ/zY eaVDsI/6/BHAAQAXMuaTNq4srRLYzsU5MjmZTDNgbgzP1g/FSlbSoNkYm9bh8qEUfr2R kNuA== X-Received: by 10.224.79.11 with SMTP id n11mr34492648qak.40.1403883260056; Fri, 27 Jun 2014 08:34:20 -0700 (PDT) Received: from HW10447.local (pool-71-166-48-47.bltmmd.fios.verizon.net. [71.166.48.47]) by mx.google.com with ESMTPSA id h65sm6399717qgf.35.2014.06.27.08.34.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 27 Jun 2014 08:34:19 -0700 (PDT) Message-ID: <53AD8EFD.7040104@gmail.com> Date: Fri, 27 Jun 2014 11:34:21 -0400 From: Josh Elser User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: Scanner.estimatedCount()? References: <53AD87EC.90905@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Nice, not having to worry about visibilities makes the problem easier. I'd encourage you to even consider forgoing sampling. You might be able to get by via combination/reduction in your client, and then setting a SummingCombiner on your cardinality table. It may be enough to get an accurate view of the statistics without a noticeable performance hit. But, you know your situation better than I do :) Let us know how it goes. On 6/27/14, 11:15 AM, Jamie Stephens wrote: > Josh, > > As you suggested, I don't want to pay the price of a CountingIterator. > Fortunately, I don't care about visibility in this case. (For a couple > of reasons, one of which is that visibility will be uniformly > distributed -- I think.) > > I'm thinking about doing this: > > In mutation-writing clients, sample. Possibly truncate keys to fit what > I need. For sampled mutations, write them to a table with a summing > combiner. (I'll probably also have historical stats tables > 'sample_20140627T10:12' or whatever, so I can see samples evolve.) Then > implement Range.getCountEstimate() by querying the sample table with > summing. Sound reasonable? > > --Jamie > > > > On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser > wrote: > > You could do this fairly efficiently by leveraging the > CountingIterator to get an exact count (taking visibilities into > account, as well) for the range in question. It isn't going to be as > fast as a precomputed answer, but you could cache that easily. > > The fact that visibilities will affect the cardinality of a term > makes it harder for us to provide this within Accumulo. The > situations where Accumulo itself cares about cardinality, it's > agnostic of the visibilities. It would be possible to try to build > an index of this information internally, but, like Eric said, that's > not there today. > > > On 6/27/14, 10:40 AM, Eric Newton wrote: > > Short answer: no. > > Long answer: > > You can scan the metadata table for the count/size of the files. > > You can query tablet servers for the basic stats of every tablet > for a > given table. This is used for balancing. > > But really you should collect the statistics you want during > ingest and > insert them in another table. > > -Eric > > > On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens > >> wrote: > > Is there a way to get a quick estimate of the number of > keys in a > given range? > > Perhaps more generally, getting an estimate of the amount > of work > (and even some sort of confidence based on, say, the age of > something) to iterate over a range. > > I'd like to do some query planning, so statistics like > these sure > would be nice. > > --Jamie > > >