hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7958) Statistics per-column family per-region
Date Thu, 28 Feb 2013 02:43:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589117#comment-13589117
] 

Todd Lipcon commented on HBASE-7958:
------------------------------------

Before we get too much into the detail, can we clarify what kind of statistics we're interested
in collecting in the first place? There are a bunch of different things we could collect,
maybe it's good to enumerate some of them and list some of the potential applications of them
before we get into the details of how they're implemented.

Here are a few of the places where I've considered adding "statistics" in the past -- though
they fall into different buckets which not everyone might consider statistics :) :

- *Block "heat"* -- keep a reservoir sample of which rows in memstore have been read recently.
When we flush the file, create a bitmap based on the sample mapping each flushed HFile block
to its "heat". These heat maps could be re-generated periodically based on block cache contents
after the file is flushed. (something like 2 bits per HFile block would mean that the heat
map for even a very large region could be re-written to disk in only a few MB). *Use case*:
when we move a region to another server, it can effectively more effectively pre-warm its
cache. 
- *Row key distribution* -- this seems to be the thing that people are talking about here
mostly. Useful for calculating better split points for MR or region splits.
- *Row key cardinality* - useful for join ordering in SQL engines with optimizers
- *Column qualifier and cell value cardinality* - useful for join ordering as well as potentially
automatic dictionary-coding?

There are bunches of others that could be brainstormed up... so my main point is: what do
we mean by stats? How should we build this so that it's extensible and usable for future stats
as well as whatever first one we want to implement?
                
> Statistics per-column family per-region
> ---------------------------------------
>
>                 Key: HBASE-7958
>                 URL: https://issues.apache.org/jira/browse/HBASE-7958
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.96.0
>            Reporter: Jesse Yates
>             Fix For: 0.96.0
>
>
> Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
> Essentially, we should have built-in statistics gathering for HBase tables. This allows
clients to have a better understanding of the distribution of keys within a table and a given
region. We could also surface this information via the UI.
> There are a couple different proposals from the email, the overview is this:
> We add in something on compactions that gathers stats about the keys that are written
and then we surface them to a table.
> The possible proposals include:
> *How to implement it?*
> # Coprocessors - 
> ** advantage - it easily plugs in and people could pretty easily add their own statistics.

> ** disadvantage - UI elements would also require this, we get into dependent loading,
which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other
CPs on compaction to ensure they see exactly what gets written (doable, but a pain)
> # Built into HBase as a custom scanner
> ** advantage - always goes in the right place and no need to muck about with loading
CPs etc.
> ** disadvantage - less pluggable, at least for the initial cut
> *Where do we store data?*
> # .META.
> ** advantage - its an existing table, so we can jam it into another CF there
> ** disadvantage - this would make META much larger, possibly leading to splits AND will
make it much harder for other processes to read the info
> # A new stats table
> ** advantage - cleanly separates out the information from META
> ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation
by arbitrary clients, but still allow clients to read it.
> Once we have this framework, we can then move to an actual implementation of various
statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message