cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL
Date Wed, 18 Feb 2015 15:28:15 GMT


Benedict commented on CASSANDRA-4914:

I'm with Cristian here, as I suggested at the NGCC last year. If we want efficient aggregations,
they should absolutely be performed at the replicas. I realise we're not aiming for that first
time around, but IMO it should be the long term goal. Shipping all of your data over the wire
is a pretty significant cost and bottleneck, making the current implementation more of a convenience
than an analytic tool.

It's possible to perform conflict resolution a few ways. Probably the best is to first let
the user specify if they care (CL=ONE is not exactly an uncommon usecase, last I heard we
reckon 30% of deployments use this. esp. for analytics queries slight staleness may not be
important), and if they do perform a repair-aware read from each neighbour to ensure the replica
is up-to-date. Or calculate the result optimistically, along with a checksum and perform the
repair if either don't match. Or select your strategy based on if the data has been updated
recently (say, last few minutes), and if it has be pessimistic, and otherwise be optimistic.
This is largely what [~tjake]'s Repair Aware Consistency Levels (CASSANDRA-7168) is about.

Generally, analytics queries are intended to be run over large, _majority_ static datasets,
so the computation should be optimised for this IMO. There is of course the complication of
supporting deterministic aggregations over multiple partitions, which would probably have
to fallback to coordinator level aggregation for operations that cannot be trivially composed
exactly (e.g. median), but most aggregations can be composed from partial computations trivially.

The provision of a sampled approach seems like another excellent idea to me, but an orthogonal
one. The calculation should probably still be offloaded to each node, then combined probabilistically.
This would also support efficient multi-partition queries for all aggregations.

I'm not saying any of these are trivial undertakings, but they should be what we're aiming

> Aggregation functions in CQL
> ----------------------------
>                 Key: CASSANDRA-4914
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, CASSANDRA-4914-V4.txt,
CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
> The requirement is to do aggregation of data in Cassandra (Wide row of column values
of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for the columns
within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                        
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                            
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130

This message was sent by Atlassian JIRA

View raw message