hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2291) [hbase] Add row count estimator
Date Thu, 20 Dec 2007 18:44:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553744

stack commented on HADOOP-2291:

What is state of this issue Edward?  Will it not work on billions of rows?

Other comments on the patch are:

+ We should add a HTable.getTableDescriptor and a HTable.getColumnFamilies?
+ Comments would be helpful.  For example would be good to explain why you of a sudden set
a variable 'i' equal to 2 and a comment confirming what you are doing finding midkeys over
and over again would be helpful (Won't this take a long time on a big table)?
+ If in your search for endKey turns up a null, won't you get a NPE when you convert back
from base64?
+ Would suggest that the estimator or an estimator override take as inputs the smallest slice
to start with and the largest slice that the estimator should try.

> [hbase] Add row count estimator
> -------------------------------
>                 Key: HADOOP-2291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2291
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: Edward Yoon
>            Priority: Minor
>         Attachments: 2291_v01.patch, Keying.java
> Internally we have a little tool that will do a rough estimate of how many rows there
are in a dataHbase.  It keeps getting larger and larger partitions running scanners until
it turns up > N occupied rows.  Once it has a number > N, it multiples by the partition
size to get an approximate row count.  
> This issue is about generalizing this feature so it could sit in the general hbase install.
 It would look something like:
> {code}
> long getApproximateRowCount(final Text startRow, final Text endRow, final long minimumCountPerPartition,
final long maximumPartitionSize)
> {code}
> Larger minimumCountPerPartition and maximumPartitionSize values would make the count
more accurate but would mean the method ran longer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message