cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL
Date Tue, 17 Feb 2015 08:07:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323798#comment-14323798
] 

Robert Stupp commented on CASSANDRA-4914:
-----------------------------------------

// flame off
You could use Spark instead of Hadoop ;)

You're right - computation of aggregates is done by the coordinator that has to pull all rows
and do computation on it. That's (unfortunately) what we can do now. If aggregates are applied
to some partitions or on an even bigger data set, performance is directly proportional to
the number of involved partitions (sounds better than _getting slower_).

I have been thinking about a method to let the other nodes ("owners of other partitions")
take part in aggregate calculation. But that implies that the other nodes _know_ about the
aggregate - i.e. basically the actual CQL. Means: the approach *could* be a two-stage aggregate,
where the first stage runs on the partitions and a second (final) stage runs on the partial
results from the first stage. But the current "storage protocol" does not allow us to do that
- it just allows to grab _raw data_. Such an approach might also improve edge cases that require
ALLOW FILTERING, which basically do the same (pipe all data to the coordinator and filter
in the coordinator).

Your approach looks interesting (although I'm not a statistics guru). Although I'm not sure
what's meant by _first record_ or _smart sampling_ since there's nothing like ordering by
partition key. Don't get me wrong - I'm interested in that.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, CASSANDRA-4914-V4.txt,
CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column values
of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for the columns
within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                        
           
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                            
       
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message