Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@locus.apache.org Received: (qmail 7612 invoked from network); 5 Feb 2008 00:25:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Feb 2008 00:25:31 -0000 Received: (qmail 86813 invoked by uid 500); 5 Feb 2008 00:25:22 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 86791 invoked by uid 500); 5 Feb 2008 00:25:22 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 86759 invoked by uid 99); 5 Feb 2008 00:25:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2008 16:25:22 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2008 00:25:15 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 18D2A714070 for ; Mon, 4 Feb 2008 16:25:08 -0800 (PST) Message-ID: <28264633.1202171108099.JavaMail.jira@brutus> Date: Mon, 4 Feb 2008 16:25:08 -0800 (PST) From: "Bryan Duxbury (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-32) [hbase] Add row count estimator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565617#action_12565617 ] Bryan Duxbury commented on HBASE-32: ------------------------------------ I was thinking, that rather than sampling and resampling repeatedly, maybe what you could do is look at the region start keys, figure out what the edit distance between start and end keys is as a proxy for size of the region, and then scan the presumed largest and presumed smallest regions. This would give you a lower and upper bound on your table size. If your selections of smallest and largest regions happened to be bad, ie the counts were inverted, you can always just flip them. > [hbase] Add row count estimator > ------------------------------- > > Key: HBASE-32 > URL: https://issues.apache.org/jira/browse/HBASE-32 > Project: Hadoop HBase > Issue Type: New Feature > Components: client > Reporter: stack > Priority: Minor > Attachments: 2291_v01.patch, Keying.java > > > Internally we have a little tool that will do a rough estimate of how many rows there are in a dataHbase. It keeps getting larger and larger partitions running scanners until it turns up > N occupied rows. Once it has a number > N, it multiples by the partition size to get an approximate row count. > This issue is about generalizing this feature so it could sit in the general hbase install. It would look something like: > {code} > long getApproximateRowCount(final Text startRow, final Text endRow, final long minimumCountPerPartition, final long maximumPartitionSize) > {code} > Larger minimumCountPerPartition and maximumPartitionSize values would make the count more accurate but would mean the method ran longer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.