cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anton Slutsky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL
Date Tue, 17 Feb 2015 05:28:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323669#comment-14323669
] 

Anton Slutsky commented on CASSANDRA-4914:
------------------------------------------

Hello all,

I noticed that some of the aggregate functions discussed on this thread made it into the trunk.
 I'm a little concerned with the implementation.  It looks like aggregates, such as sum, avg,
etc. are implemented in code by basically looping through the result set pages and computing
the desired aggregates in code.  I'm worried that, since Cassandra is meant for large volumes
of data, this is not at all a feasible implementation for real world cases.  I tried using
avg on a more or less sizable dataset and observed two things -- first, my select statement
would time out even with bumped up read timeout setting and second, CPU that's running the
average computation is quite busy.

Obviously, there's only so much that can be done in terms of computing these aggregates without
resorting to some sort of distributed computation framework, but I'd like to suggest a slightly
different approach.  I wonder if we can just rethink how we think about aggregate functions
in context of large data.  Perhaps, what we could do is consider a probabilistic aggregates
instead of raw computable ones?  That is, instead of striving to compute an aggregate on an
entire resultset, maybe we can compute the aggregate with a stated probability of that aggregate
being true.

For example:

select probabilistic_avg(my_col) from my_table;

would return something like a map:

{"avg":101.1, "prob":0.78}

where "avg" is our probabilistic avg and "prob" is the probability of it being what we say
it is.

Of course, that wont be as good as the real thing, but it still has value in many cases, I
think.  And it can be implemented in a scalable way with some scratch system tables.

I'm happy to give it a stab if this is of interest to anyone.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, CASSANDRA-4914-V4.txt,
CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column values
of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for the columns
within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                        
           
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                            
       
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message