hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-32) [hbase] Add row count estimator
Date Mon, 21 Jul 2008 17:57:31 GMT

    [ https://issues.apache.org/jira/browse/HBASE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615333#action_12615333

Andrew Purtell commented on HBASE-32:

One possible option is to count the entries in the MapFile indexes, multiply that count by
whatever hbase.io.index.interval (or the INDEX_INTERVAL HTD attribute) is, consider all of
the MapFiles for the columns in a table, and choose the largest value. Do this for all of
the table's regions. The result would be a reasonable estimate, but the whole process sounds
expensive. Originally I was thinking that the regionservers could do this since they have
to read in the MapFile indexes anyway, and also they know the count of rows in memcache, but
if regionservers limit the number of in-memory MapFile indexes to avoid OOME as has been discussed,
they won't have all of the information on hand.

Maybe a map of MapFile to row count estimations can be stored in the FS next to the MapFiles
and can be updated appropriately during compactions. Then a client can iterate over the regions
of a table, ask the regionservers involved for row count estimations, the regionservers can
consult the estimation-map and send the largest count found there for the table plus the largest
memcache count for the table, and finally the client can total all of the results.

> [hbase] Add row count estimator
> -------------------------------
>                 Key: HBASE-32
>                 URL: https://issues.apache.org/jira/browse/HBASE-32
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client
>            Reporter: stack
>            Priority: Minor
>         Attachments: 2291_v01.patch, Keying.java
> Internally we have a little tool that will do a rough estimate of how many rows there
are in a dataHbase.  It keeps getting larger and larger partitions running scanners until
it turns up > N occupied rows.  Once it has a number > N, it multiples by the partition
size to get an approximate row count.  
> This issue is about generalizing this feature so it could sit in the general hbase install.
 It would look something like:
> {code}
> long getApproximateRowCount(final Text startRow, final Text endRow, final long minimumCountPerPartition,
final long maximumPartitionSize)
> {code}
> Larger minimumCountPerPartition and maximumPartitionSize values would make the count
more accurate but would mean the method ran longer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message