cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Yaskevich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results
Date Mon, 27 Jun 2016 23:51:57 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352073#comment-15352073
] 

Pavel Yaskevich commented on CASSANDRA-12073:
---------------------------------------------

[~doanduyhai] The changes look good to me, because in case of prefix queries we should "step
over" partial terms (because they are essentially invisible for the expression in PREFIX mode)
until upper bound tells us to stop instead of trying to stop right when partial has been encountered.


Couple of comments regarding code: since partialTermAndPrefixOperation is only used at one
place I would rather have it as inline if with the comment on top e.g. 
{noformat}
// we need to step over all of the partial terms, in PREFIX mode,
// encountered by the query until upper-bound tells us to stop
if (operation == Op.PREFIX && term.isPartial()) 
   continue;

// haven't reached the start of the query range yet, let's 
// keep skip the current term until lower bound is satisfied
if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
   continue;
{noformat}

Also I think it essential to have a test-case in OnDiskIndexTest since it gives you more control
over index blocks and internals than SASIndexTest.

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12073
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Cassandra 3.7
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.x
>
>         Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
>     id uuid PRIMARY KEY,
>     artist text,
>     country text,
>     quality text,
>     status text,
>     title text,
>     year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id                                   | artist    | country        | quality | status
   | title                     | year
> --------------------------------------+-----------+----------------+---------+-----------+---------------------------+------
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |            USA |  normal |  Official
|           Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |        Unknown |  normal | Promotion
|                Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |            USA |  normal |  Official
|   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |        Unknown |  normal |  Official
|                Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |            USA |  normal |  Official
|   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |          Italy |  normal |  Official
|                  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |            USA |  normal |  Official
|            Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |      Australia |  normal |  Official
|                     Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |            USA |  normal |  Official
| Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |        Unknown |  normal |      null
|                Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal |  Official
|                Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
>             {
>                 if (currentBlock == null)
>                     return endOfData();
>                 if (offset >= 0 && offset < currentBlock.termCount())
>                 {
>                     DataTerm currentTerm = currentBlock.getTerm(nextOffset());
>                     if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
>                         continue;
>                     // flip the flag right on the first bounds match
>                     // to avoid expensive comparisons
>                     checkLower = false;
>                     if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
>                         return endOfData();
>                     return currentTerm;
>                 }
>                 nextBlock();
>             }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether the term match
but also returns *false* when it's a *partial term* !
> {code:java}
>     public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
>     {
>         if (!hasUpper())
>             return true;
>         if (nonMatchingPartial(term))
>             return false;
>         int cmp = term.compareTo(validator, upper.value, false);
>         return cmp < 0 || cmp == 0 && upper.inclusive;
>     }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> ...
> Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120
> Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360
> Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440
> ...
> {noformat}
> *lady gaga* did match but then *lady of bells* fails because it is a partial match. Then
SASI returns {{endOfData()}} instead continuing to check *lady pank*
> One quick & dirty fix I've found is to add {{!e.nonMatchingPartial(currentTerm)}}
to the *if* condition
> {code:java}
>  if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && !e.nonMatchingPartial(currentTerm))
>                         return endOfData();
> {code} 
> I feel it kinda dirty and I suspect this bug to has more impact than the *PREFIX* case
> Wdyt [~xedin] [~beobal] [~jrwest] ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message