cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11130) [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
Date Sun, 07 Feb 2016 21:15:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136448#comment-15136448
] 

DOAN DuyHai commented on CASSANDRA-11130:
-----------------------------------------

I've though about one possible way to provide the strict {{=}} semantics when using StandardAnalyzer.

 On SASI side, you still hit disk to fetch all matching terms but then you perform a post-processing
to return only exact match.

 I don't know whether you store the source column value in SASI index or not. If yes it should
be easy. If no, then it'll be expensive because we'll hit Cassandra SSTables before being
able to filter out non exact matches

> [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-11130
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11130
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Tested from build [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
>            Reporter: DOAN DuyHai
>            Assignee: Pavel Yaskevich
>
> Tested from build [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
'1'}  AND durable_writes = true;
> CREATE TABLE music.albums (
>     id int PRIMARY KEY,
>     artist text,
>     title1 text,
>     title2 text
> );
> CREATE CUSTOM INDEX ON music.albums (title1) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'tokenization_skip_stop_words': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'mode': 'PREFIX', 'tokenization_enable_stemming': 'true'};
> CREATE CUSTOM INDEX ON music.albums (title2) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'tokenization_skip_stop_words': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'mode': 'CONTAINS', 'tokenization_enable_stemming': 'true'};
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(1, 'Superpitcher', 'Yesterday', 'Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(2, 'Hilary Duff', 'So Yesterday', 'So Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(3, 'The Mr. T Experience', 'Yesterday Rules', 'Yesterday Rules');
> SELECT artist,title1 FROM music.albums WHERE title1='Yesterday';
>  artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>  
> (3 rows)
> SELECT artist,title1 FROM music.albums WHERE title2='Yesterday';
> artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>   
> (3 rows)
> {code}
> The semantic of *=* is not respected. SASI should return only 1 row with exact match.
Using *LIKE* would return all 3 rows. It does impact both *PREFIX* and *CONTAINS* mode. Using
*NonTokenizerAnalyzer* return 1 row with exact match.
>  So indeed, the semantics of *=* depends on the chosen analyzer, which is inconsistent.
We should force *=* to be exact match no matter which analyzer is chosen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message