cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Petrov (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer
Date Tue, 21 Mar 2017 14:33:41 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Petrov reassigned CASSANDRA-12674:
---------------------------------------

    Assignee: Alex Petrov

> [SASI] Confusing AND/OR semantics for StandardAnalyzer 
> -------------------------------------------------------
>
>                 Key: CASSANDRA-12674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12674
>             Project: Cassandra
>          Issue Type: Bug
>          Components: sasi
>         Environment: Cassandra 3.7
>            Reporter: DOAN DuyHai
>            Assignee: Alex Petrov
>
> {code:sql}
> Connected to Test Cluster at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
> Use HELP for help.
> cqlsh> use test;
> cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY KEY((id),
clustering));
> cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
>     'mode': 'CONTAINS',
>      'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>     'analyzed': 'true'};
> //1st example SAME PARTITION KEY
> cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 'homeworker');
> cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 'hardworker');
> cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%';
>  id | clustering | val
> ----+------------+------------
>   1 |          1 | homeworker
>   1 |          2 | hardworker
> (2 rows)
> //2nd example DIFFERENT PARTITION KEY
> cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 'speedrun');
> cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 'longrun');
> cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%';
>  id | clustering | val
> ----+------------+---------
>  11 |          1 | longrun
> (1 rows)
> {code}
> In the 1st example, both rows belong to the same partition so SASI returns both values.
Indeed {{LIKE '%work home%'}} means {{contains 'work' OR 'home'}} so the result makes sense
> In the 2nd example, only one row is returned whereas we expect 2 rows because {{LIKE
'%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should be returned too.
> So where is the problem ? Explanation:
> When there is only 1 predicate, the root operation type is an *AND*:
> {code:java|title=QueryPlan}
>     private Operation analyze()
>     {
>         try
>         {
>             Operation.Builder and = new Operation.Builder(OperationType.AND, controller);
>             controller.getExpressions().forEach(and::add);
>             return and.complete();
>         }
>        ...
> }
> {code}
> During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for the searched
term: {{long}} and {{run}}, which corresponds to an *OR* logic. However, this piece of code
just ruins the *OR* logic:
> {code:java|title=Operation}
>         public Operation complete()
>         {
>             if (!expressions.isEmpty())
>             {
>                 ListMultimap<ColumnDefinition, Expression> analyzedExpressions
= analyzeGroup(controller, op, expressions);
>                 RangeIterator.Builder<Long, Token> range = controller.getIndexes(op,
analyzedExpressions.values());
>      ...
> }
> {code}
> As you can see, we blindly take all the *values* of the MultiMap (which contains a single
entry for the {{val}} column with 2 expressions) and pass it to {{controller.getIndexes(...)}}
> {code:java|title=QueryController}
>     public RangeIterator.Builder<Long, Token> getIndexes(OperationType op, Collection<Expression>
expressions)
>     {
>         if (resources.containsKey(expressions))
>             throw new IllegalArgumentException("Can't process the same expressions multiple
times.");
>         RangeIterator.Builder<Long, Token> builder = op == OperationType.OR
>                                                 ? RangeUnionIterator.<Long, Token>builder()
>                                                 : RangeIntersectionIterator.<Long,
Token>builder();
>         ...
> }
> {code}
> And because the root operation has *AND* type, the {{RangeIntersectionIterator}} will
be used on both expressions {{long}} and {{run}}.
> So when data belong to different partitions, we have the *AND* logic that applies and
eliminates _speedrun_
> When data belong to the same partition but different row, the {{RangeIntersectionIterator}}
returns a single partition and then the rows are filtered further by {{operationTree.satisfiedBy}}
and the results are correct
> {code:java|title=QueryPlan}
>             while (currentKeys.hasNext())
>                 {
>                     DecoratedKey key = currentKeys.next();
>                     if (!keyRange.right.isMinimum() && keyRange.right.compareTo(key)
< 0)
>                         return endOfData();
>                     try (UnfilteredRowIterator partition = controller.getPartition(key,
executionController))
>                     {
>                         Row staticRow = partition.staticRow();
>                         List<Unfiltered> clusters = new ArrayList<>();
>                         while (partition.hasNext())
>                         {
>                             Unfiltered row = partition.next();
>                             if (operationTree.satisfiedBy(row, staticRow, true))
>                                 clusters.add(row);
>                         }
>  ...
> }
> {code}
> /cc [~xedin] [~ifesdjeen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message