cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Lerer (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8940) Inconsistent select count and select distinct
Date Tue, 28 Apr 2015 12:59:06 GMT


Benjamin Lerer commented on CASSANDRA-8940:

{quote}Thanks for the update. I guess you are on to something. Again, if there's anything
I can help with. I'm happy to pitch in.{quote}

Thanks for the offer :-). For the moment, I am just digging.

(a bit of topic): I wasn't aware that Cassandra performs the count on the coordinator. I wonder
why one couldn't push the count operator to the replicas involved. I see that aggregate functions
in Cassandra trunk are implemented in a similar fashion. A pity if you ask me.{quote}

The advantage of this approach was that the consistency problem was already solve. The coordinator
was guaranty to have the latest data. 
The plan was to deliver that initial version first and to make it better in the future. If
you are interested in it, you can follow CASSANDRA-8826. 

As I understand it, select count queries operate on top of normal select all queries. Does
this mean that this 'skipping' of rows might also be a problem in other cases? Or is it only
a problem because the result set is processed/paged on a Cassandra node and not in a driver?

The 'skipping' of row might apparently be a problem for queries requesting data from more
that one partition. I do not know yet the extends of the problem.

> Inconsistent select count and select distinct
> ---------------------------------------------
>                 Key: CASSANDRA-8940
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 2.1.2
>            Reporter: Frens Jan Rumph
>            Assignee: Benjamin Lerer
>         Attachments: 7b74fb00-e935-11e4-b10c-317579db7eb4.csv, 8d5899d0-e935-11e4-847b-2d06da75a6cd.csv,
> When performing {{select count( * ) from ...}} I expect the results to be consistent
over multiple query executions if the table at hand is not written to / deleted from in the
mean time. However, in my set-up it is not. The counts returned vary considerable (several
percent). The same holds for {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something like:
> {code}
>     id frozen<id_type>,
>     bucket bigint,
>     offset int,
>     value double,
>     PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
>     tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The consistency
level for the queries was ONE.

This message was sent by Atlassian JIRA

View raw message