hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7958) Statistics per-column family per-region
Date Wed, 14 May 2014 06:06:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997314#comment-13997314
] 

Andrew Purtell commented on HBASE-7958:
---------------------------------------

Thanks for the summary [~jesse_yates]. 

The description for this issue is 'statistics per-column family per-region'. In that scope
maintaining a system table for statistics gathering is unnecessary, we can use region local
storage. Perhaps during compactions we could calculate the basic things people seem to want:
row count, row key cardinality, min/max/avg size per value, and total value size. Per CF,
per region. Column qualifier cardinality also seems like it might be useful. Perhaps we could
maintain a tree of statistic files, at the HFile level, at the CF level, at the table level,
at the namespace level. Compactions would record into the resulting HFiles the statistics
metadata calculated during processing. A background process running in the master could aggregate
while following the tree in the background, swapping updated results for older results at
every level when ready. We should be able to handle point-in-time counts and simple statistical
properties in this way? It could be possible to use a system statistics table instead of files,
but why have regionservers exchange RPCs if not necessary (and updating a table inline with
compaction or split handing brings back unfond memories of something we had once called the
'region historian').

> Statistics per-column family per-region
> ---------------------------------------
>
>                 Key: HBASE-7958
>                 URL: https://issues.apache.org/jira/browse/HBASE-7958
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.95.2
>            Reporter: Jesse Yates
>            Assignee: Jesse Yates
>         Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch
>
>
> Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
> Essentially, we should have built-in statistics gathering for HBase tables. This allows
clients to have a better understanding of the distribution of keys within a table and a given
region. We could also surface this information via the UI.
> There are a couple different proposals from the email, the overview is this:
> We add in something on compactions that gathers stats about the keys that are written
and then we surface them to a table.
> The possible proposals include:
> *How to implement it?*
> # Coprocessors - 
> ** advantage - it easily plugs in and people could pretty easily add their own statistics.

> ** disadvantage - UI elements would also require this, we get into dependent loading,
which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other
CPs on compaction to ensure they see exactly what gets written (doable, but a pain)
> # Built into HBase as a custom scanner
> ** advantage - always goes in the right place and no need to muck about with loading
CPs etc.
> ** disadvantage - less pluggable, at least for the initial cut
> *Where do we store data?*
> # .META.
> ** advantage - its an existing table, so we can jam it into another CF there
> ** disadvantage - this would make META much larger, possibly leading to splits AND will
make it much harder for other processes to read the info
> # A new stats table
> ** advantage - cleanly separates out the information from META
> ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation
by arbitrary clients, but still allow clients to read it.
> Once we have this framework, we can then move to an actual implementation of various
statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message