cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results
Date Sun, 26 Jun 2016 13:33:33 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350113#comment-15350113
] 

DOAN DuyHai edited comment on CASSANDRA-12073 at 6/26/16 1:32 PM:
------------------------------------------------------------------

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
          return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
    currentBlock = null;

    if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
        return;

    currentBlock = dataLevel.getBlock(nextBlockIndex());
    offset = checkLower ? order.startAt(currentBlock, e) : currentBlock.minOffset(order);

    // let's check the last term of the new block right away
    // if expression's upper bound is satisfied by it such means that we can avoid
    // doing any expensive upper bound checks for that block.
    checkUpper = e.hasUpper() &&
    !e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire next block
will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current {{e::isLowerSatisfiedBy}}
and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to {{Expression::partialTermAndPrefixOperation}}
which describes exactly which condition it checks, much clearer name from my pov. Also make
the method access *public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current term and process
the next one

{code:java}
                if (offset >= 0 && offset < currentBlock.termCount())
                {
                    DataTerm currentTerm = currentBlock.getTerm(nextOffset());

                    if (e.partialTermAndPrefixOperation(currentTerm) || 
                        (checkLower && !e.isLowerSatisfiedBy(currentTerm)))
                        continue;

                    // flip the flag right on the first bounds match
                    // to avoid expensive comparisons
                    checkLower = false;

                    if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
                        return endOfData();

                    return currentTerm;
                }
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. To test the
incorrect next block skipping we'll need to generate enough data so that the last term of
next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?


was (Author: doanduyhai):
Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
          return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
    currentBlock = null;

    if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
        return;

    currentBlock = dataLevel.getBlock(nextBlockIndex());
    offset = checkLower ? order.startAt(currentBlock, e) : currentBlock.minOffset(order);

    // let's check the last term of the new block right away
    // if expression's upper bound is satisfied by it such means that we can avoid
    // doing any expensive upper bound checks for that block.
    checkUpper = e.hasUpper() &&
    !e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire next block
will be skipped 

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current {{e::isLowerSatisfiedBy}}
and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to {{Expression::partialTermAndPrefixOperation}}
which describes exactly which condition it checks, much clearer name from my pov. Also make
the method access *public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current term and process
the next one

{code:java}
                if (offset >= 0 && offset < currentBlock.termCount())
                {
                    DataTerm currentTerm = currentBlock.getTerm(nextOffset());

                    if (e.partialTermAndPrefixOperation(currentTerm) || 
                        (checkLower && !e.isLowerSatisfiedBy(currentTerm)))
                        continue;

                    // flip the flag right on the first bounds match
                    // to avoid expensive comparisons
                    checkLower = false;

                    if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
                        return endOfData();

                    return currentTerm;
                }
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. To test the
incorrect next block skipping we'll need to generate enough data so that the last term of
next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12073
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Cassandra 3.7
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.x
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
>     id uuid PRIMARY KEY,
>     artist text,
>     country text,
>     quality text,
>     status text,
>     title text,
>     year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id                                   | artist    | country        | quality | status
   | title                     | year
> --------------------------------------+-----------+----------------+---------+-----------+---------------------------+------
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |            USA |  normal |  Official
|           Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |        Unknown |  normal | Promotion
|                Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |            USA |  normal |  Official
|   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |        Unknown |  normal |  Official
|                Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |            USA |  normal |  Official
|   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |          Italy |  normal |  Official
|                  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |            USA |  normal |  Official
|            Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |      Australia |  normal |  Official
|                     Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |            USA |  normal |  Official
| Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |        Unknown |  normal |      null
|                Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal |  Official
|                Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
>             {
>                 if (currentBlock == null)
>                     return endOfData();
>                 if (offset >= 0 && offset < currentBlock.termCount())
>                 {
>                     DataTerm currentTerm = currentBlock.getTerm(nextOffset());
>                     if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
>                         continue;
>                     // flip the flag right on the first bounds match
>                     // to avoid expensive comparisons
>                     checkLower = false;
>                     if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
>                         return endOfData();
>                     return currentTerm;
>                 }
>                 nextBlock();
>             }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether the term match
but also returns *false* when it's a *partial term* !
> {code:java}
>     public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
>     {
>         if (!hasUpper())
>             return true;
>         if (nonMatchingPartial(term))
>             return false;
>         int cmp = term.compareTo(validator, upper.value, false);
>         return cmp < 0 || cmp == 0 && upper.inclusive;
>     }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> ...
> Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120
> Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360
> Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440
> ...
> {noformat}
> *lady gaga* did match but then *lady of bells* fails because it is a partial match. Then
SASI returns {{endOfData()}} instead continuing to check *lady pank*
> One quick & dirty fix I've found is to add {{!e.nonMatchingPartial(currentTerm)}}
to the *if* condition
> {code:java}
>  if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && !e.nonMatchingPartial(currentTerm))
>                         return endOfData();
> {code} 
> I feel it kinda dirty and I suspect this bug to has more impact than the *PREFIX* case
> Wdyt [~xedin] [~beobal] [~jrwest] ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message