Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <28264633.1202171108099.JavaMail.jira@brutus>
Date: Mon, 4 Feb 2008 16:25:08 -0800 (PST)
From: "Bryan Duxbury (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Commented: (HBASE-32) [hbase] Add row count estimator
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565617#action_12565617 ] 

Bryan Duxbury commented on HBASE-32:
------------------------------------

I was thinking, that rather than sampling and resampling repeatedly, maybe what you could do is look at the region start keys, figure out what the edit distance between start and end keys is as a proxy for size of the region, and then scan the presumed largest and presumed smallest regions. This would give you a lower and upper bound on your table size. If your selections of smallest and largest regions happened to be bad, ie the counts were inverted, you can always just flip them.

> [hbase] Add row count estimator
> -------------------------------
>
>                 Key: HBASE-32
>                 URL: https://issues.apache.org/jira/browse/HBASE-32
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client
>            Reporter: stack
>            Priority: Minor
>         Attachments: 2291_v01.patch, Keying.java
>
>
> Internally we have a little tool that will do a rough estimate of how many rows there are in a dataHbase.  It keeps getting larger and larger partitions running scanners until it turns up > N occupied rows.  Once it has a number > N, it multiples by the partition size to get an approximate row count.  
> This issue is about generalizing this feature so it could sit in the general hbase install.  It would look something like:
> {code}
> long getApproximateRowCount(final Text startRow, final Text endRow, final long minimumCountPerPartition, final long maximumPartitionSize)
> {code}
> Larger minimumCountPerPartition and maximumPartitionSize values would make the count more accurate but would mean the method ran longer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.