cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Catalin Alexandru Zamfir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8131) Short-circuited query results from collection index query
Date Fri, 17 Oct 2014 12:57:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175030#comment-14175030
] 

Catalin Alexandru Zamfir commented on CASSANDRA-8131:
-----------------------------------------------------

True. I was threw off by the "CONTAINS" which I interpreted as a search, basically linking
to "searches" (CONTAINS) in the same result set (at least in my mind). Cassandra does not
support OR to reach the goal we're trying to achieve.

Instead, I've tried this, which I guess it's already fixed (should have returned one row):
{noformat}
insert into by_sets (id, datakeys, datavars) values (5, {'a', 'c'}, {'q'});
select * from by_sets WHERE datakeys CONTAINS 'a' AND datakeys CONTAINS 'c' ;

 id | datakeys   | datavars
----+------------+----------
  5 | {'a', 'c'} |    {'q'}
  1 |      {'a'} |    {'b'}
  4 |      {'a'} |    {'z'}

(3 rows)
{noformat}

> Short-circuited query results from collection index query
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8131
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8131
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Debian Wheezy, Oracle JDK, Cassandra 2.1
>            Reporter: Catalin Alexandru Zamfir
>            Assignee: Benjamin Lerer
>              Labels: collections, cql3, cqlsh, query, queryparser, triaged
>             Fix For: 2.1.0
>
>
> After watching Jonathan's 2014 summit video, I wanted to give collection indexes a try
as they seem to be a fit for a "search by key/values" usage pattern we have in our setup.
Doing some test queries that I expect users would do against the table, a short-circuit behavior
came up:
> Here's the whole transcript:
> {noformat}
> CREATE TABLE by_sets (id int PRIMARY KEY, datakeys set<text>, datavars set<text>);
> CREATE INDEX by_sets_datakeys ON by_sets (datakeys);
> CREATE INDEX by_sets_datavars ON by_sets (datavars);
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (1, {'a'}, {'b'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (2, {'c'}, {'d'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (3, {'e'}, {'f'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (4, {'a'}, {'z'});
> SELECT * FROM by_sets;
>  id | datakeys | datavars
> ----+----------+----------
>   1 |    {'a'} |    {'b'}
>   2 |    {'c'} |    {'d'}
>   4 |    {'a'} |    {'z'}
>   3 |    {'e'} |    {'f'}
> {noformat}
> We then tried this query which short-circuited:
> {noformat}
> SELECT * FROM by_sets WHERE datakeys CONTAINS 'a' AND datakeys CONTAINS 'c';
>  id | datakeys | datavars
> ----+----------+----------
>   1 |    {'a'} |    {'b'}
>   4 |    {'a'} |    {'z'}
> (2 rows)
> {noformat}
> Instead of receveing 3 rows, which match the datakeys CONTAINS 'a' AND datakeys CONTAINS
'c' we only got the first.
> Doing the same, but with CONTAINS 'c' first, ignores the second AND.
> {noformat}
> SELECT * FROM by_sets WHERE datakeys CONTAINS 'c' AND datakeys CONTAINS 'a' ;
>  id | datakeys | datavars
> ----+----------+----------
>   2 |    {'c'} |    {'d'}
> (1 rows)
> {noformat}
> Also, on a side-note, I have two indexes on both datakeys and datavars. But when trying
to run a query such as:
> {noformat}
> select * from by_sets WHERE datakeys CONTAINS 'a' AND datavars CONTAINS 'z';
> code=2200 [Invalid query] message="Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. 
> If you want to execute this query despite the performance unpredictability, use ALLOW
FILTERING"
> {noformat}
> The second column, after AND (even if I inverse the order) requires an "allow filtering"
clause yet the column is indexed an an in-memory "join" of the primary keys of these sets
on the coordinator could build up the result.
> Could anyone explain the short-circuit behavior?
> And the requirement for "allow-filtering" on a secondly indexed column?
> If they're not bugs but intended they should be documented better, at least their limitations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message