hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Royston Sellman (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5123) Provide more aggregate functions for Aggregations Protocol
Date Fri, 06 Jan 2012 14:59:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181366#comment-13181366

Royston Sellman commented on HBASE-5123:

Re: 5123 I have also had some time to think about other aggregation functions (Please be aware
that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge
of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE
value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return
MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE
column (SVSC) case. 

So, common SVSC aggregation functions:
currently supported:
avg (arithmetic mean)

not currently supported:

for column values of all numeric types, returning values of that type. Current support is
only for Long type.

Some thoughts on the future possibilities:
An example of a future SINGLE value MULTIPLE column use case could be weighted versions of
the above functions i.e. a column of weights applied to the column of values then the new
aggregation derived.
(note: there is a very good description of Weighted Median in the R language documentation:

An example of future MULTIPLE value SINGLE column could be range: return all rows with a column
value between two values. Maybe this is a bad example because there could be better HBase
ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e.
return an array containing values derived from applying one of the SVSC functions to a binned
column e.g:
int bins = 100;
aClient.sum(table, ci, scan, bins); => {12.3, 14.5...}
Another example (common in several programming languages) is to map an arbitrary function
over a column and return the new vector. Of course, again this may be a bad example in the
case of long HBase columns but it seems like an appropriate thing to do with coprocessors.

MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there
has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep
quiet for now.

I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial
functions on tables and general purpose (but coprocessor enabled/regionserver distributed)

> Provide more aggregate functions for Aggregations Protocol
> ----------------------------------------------------------
>                 Key: HBASE-5123
>                 URL: https://issues.apache.org/jira/browse/HBASE-5123
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Zhihong Yu
> Royston requested the following aggregates on top of what we already have:
> Median, Weighted Median, Mult
> See discussion entitled 'AggregateProtocol Help' on user list

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message