cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Lerer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10707) Add support for Group By to Select statement
Date Wed, 20 Jan 2016 14:54:40 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108647#comment-15108647
] 

Benjamin Lerer commented on CASSANDRA-10707:
--------------------------------------------

||patch||utests||dtests||
|[trunk|https://github.com/apache/cassandra/compare/trunk...blerer:10707-trunk]|[trunk|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-10707-trunk-testall/]|[trunk|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-10707-trunk-dtest/]|

The DTest branch is [here|https://github.com/riptano/cassandra-dtest/pull/753/files]

The classes {{GroupBySpecification}} and {{GroupSelector}} are used to create {{GroupMaker}}
instances. A {{GroupMaker}} is used, on a sorted set of rows, to determine if a row belongs
to the same group as the previous row or not.

For the moment, only one type of {{GroupSelector}} exists for a primary key columns. Its serialization
mechanism has, nevertheless, been implemented in such a way that it will be possible to add
new implementations (to allow the use of functions in the {{GROUP BY}} clause, for example)
without breaking backward compatibility.

{{SelectStatement}} and {{Selection}} have been modified in order to use {{GroupBySpecification}}
and {{GroupMaker}} when building the result set.

Group by queries are always paged internally to avoid {{OOMExceptions}}. Two new {{DataLimits}}
have been added to manage the group by paging {{CQLGroupByLimits}} and {{CQLGroupByPagingLimits}}.
They keep track of the group count and of the row count to make sure that the processing is
stopped as soon as one of the limits is reached.

A group is only counted once the next one is reached, as a group can be spread over multiple
pages. The problem with this approach is that a counter can only know if it has reach the
group limit when it has reached a row that should not be added to the resultset. As multiple
counters are used when a request is processed the extra row is not filtered out until it reachs
the counter of the {{QueryPager}}. To do that a special factory method has been added to {{DataLimits}}:
{{forPagingByQueryPager(int pageSize)}}.

This approach was not working properly in the case of the {{MultiPartitionPager}} as an extra
counter was added on top of the one of the {{SinglePartitionPager}}. To solve that problem
the use of the counter in {{MultiPartitionPager}} has been replaced by another mechanism.

The internal paging is performed by the {{GroupByQueryPager}} which automatically fetch new
pages of data when needed. 

As the {{DataLimits}} needs to be updated for each new internal query and the {{ReadQuery}}
classes are immutable a new {{withUpdatedLimit}} method as been a added to all the {{ReadQuery}}
classes.

In order to simplify the {{SelectStatement}} code, the patch also modify slightly the way
queries with aggregates but no {{GROUP BY}} is working.
I implemented it initially on top of the Group by paging but realized afterward that it was
breaking backward compatibility. We will anyway be able in the future to switch back to it.
Once we are sure that the group by paging is supported by the previous versions.        


> Add support for Group By to Select statement
> --------------------------------------------
>
>                 Key: CASSANDRA-10707
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support {{GROUP BY}}
on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the clustering column
level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP BY partitionKey,
clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message