cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cristian O (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL
Date Tue, 17 Feb 2015 12:50:13 GMT


Cristian O commented on CASSANDRA-4914:

A couple of thoughts:

- doing aggregations on the coordinator is clearly not feasible in the real world beyond some
toy use cases. I don't know the internals but it should be doable to push the aggregation
function to the partitions without requiring the data interface to understand CQL. Note that
*all* agg functions are eminently parallelizible including AVG which obviously can be computed
from SUM/COUNT on the same elements. As someone pointed out before these are all REDUCE type
functions (or monoids if you like)

- dealing with consistency is tricky but then Cassandra is by design eventually consistent
so why not have eventually consistent aggregations. Just pick a partition and aggregate on
that. With large datasets an average differing at the sixth decimal won't really matter. Or
if you want to be really fancy compute on every (or quorum) partitions and return results
with a tolerance factor. 

Maybe it's useful to target this feature at use cases that need fast simple aggregates on
large amounts of data like for example charts on time series.

For more complex analytics Spark on top of Cass is actually an excellent solution already
if it's setup correctly in terms of colocation. This would help use cases when Spark is too
much of an overhead. 

> Aggregation functions in CQL
> ----------------------------
>                 Key: CASSANDRA-4914
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, CASSANDRA-4914-V4.txt,
CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
> The requirement is to do aggregation of data in Cassandra (Wide row of column values
of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for the columns
within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                        
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                            
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130

This message was sent by Atlassian JIRA

View raw message