cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: How to query '%' character using LIKE operator in Cassandra 3.7?
Date Thu, 15 Sep 2016 14:11:07 GMT
Currently SASI can only understand the % for the beginning (suffix) or
ending (prefix) position.

Any expression containing the % in the middle like %w%a% will not be
interpreter by SASI as wildcard.

%w%a% will translate into "Give me all results containing w%a

On Thu, Sep 15, 2016 at 3:58 PM, Mikhail Krupitskiy <
mikhail.krupitskiy@jetbrains.com> wrote:

> Thank you for the investigation. Will wait for a fix and news.
>
> Probably it’s not a directly related question but what do you think about
> CASSANDRA-12573? Let me know if it’s better to create a separate thread for
> it.
>
> Thanks,
> Mikhail
>
>
> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduyhai@gmail.com> wrote:
>
> Ok so I've found the source of the issue, it's pretty well hidden because
> it is NOT in the SASI source code directly.
>
> Here is the method where C* determines what kind of LIKE expression you're
> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>
> https://github.com/apache/cassandra/blob/trunk/src/java/
> org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#
> L733-L778
>
> As you can see, it's pretty simple, maybe too simple. Indeed, they forget
> to remove escape character BEFORE doing the matching so if your search is LIKE
> '%%esc%', the detected expression is LIKE_CONTAINS.
>
> A possible fix would be:
>
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
> the column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced
> regex to exclude the %% from parsing e.g
>
> Step 2) is dead easy but step 1) is harder because I don't know if
> converting the bytebuffer into String at this stage of the CQL parser is
> expensive or not (in term of computation)
>
> Let me try a patch
>
>
>
> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:
>
>> Ok you're right, I get your point
>>
>> LIKE '%%esc%' --> startWith('%esc')
>>
>> LIKE 'escape%%' -->  = 'escape%'
>>
>> What I strongly suspect is that in the source code of SASI, we parse the
>> % xxx % expression BEFORE applying escape. That will explain the observed
>> behavior. E.g:
>>
>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>
>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>
>> Let me check in the source code and try to reproduce the issue
>>
>>
>>
>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
>> mikhail.krupitskiy@jetbrains.com> wrote:
>>
>>> Looks like we have different understanding of what results are expected.
>>> I based my understanding on http://docs.datastax.com/en
>>> /cql/3.3/cql/cql_using/useSASIIndex.html
>>> According to the doc ‘esc’ is a pattern for exact match and I guess that
>>> there is no semantical difference between two LIKE patterns (both of
>>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>>
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>> *%esc*
>>>
>>> Why ‘containing’? I expect that it should be ’starting’..
>>>
>>>
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>> *escape%*esc
>>>
>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>
>>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but
it
>>> returns nothing (CASSANDRA-12573).
>>>
>>> What I’m missing?
>>>
>>> Thanks,
>>> Mikhail
>>>
>>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduyhai@gmail.com> wrote:
>>>
>>> CREATE CUSTOM INDEX ON test.escape(val) USING '
>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode':
>>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa
>>> si.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
>>>
>>> I don't see any problem in the results you got
>>>
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>> *%esc*
>>>
>>> Why ‘containing’? I expect that it should be ’starting’..
>>>
>>>
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>> *escape%*esc
>>>
>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>
>>>
>>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy <
>>> mikhail.krupitskiy@jetbrains.com> wrote:
>>>
>>>> Thanks for the reply.
>>>> Could you please provide what index definition did you use?
>>>> With the index from my script I get the following results:
>>>>
>>>> cqlsh:test> select * from escape;
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>>   2 | escape%me
>>>> *  3 | escape%esc*
>>>>
>>>> Contains search
>>>>
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>>   3
>>>> * | escape%esc*(2 rows)
>>>>
>>>>
>>>> Prefix search
>>>>
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   2 | escape%me
>>>>   3
>>>> * | escape%esc*
>>>>
>>>> Thanks,
>>>> Mikhail
>>>>
>>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduyhai@gmail.com> wrote:
>>>>
>>>> Use % to escape %
>>>>
>>>> cqlsh:test> select * from escape;
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>>   2 | escape%me
>>>>
>>>>
>>>> Contains search
>>>>
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>>
>>>> (1 rows)
>>>>
>>>>
>>>> Prefix search
>>>>
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>
>>>>  id | val
>>>> ----+-----------
>>>>   2 | escape%me
>>>>
>>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy <
>>>> mikhail.krupitskiy@jetbrains.com> wrote:
>>>>
>>>>> Hi Cassandra guys,
>>>>>
>>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char
in a
>>>>> search pattern.
>>>>> Here is my test script:
>>>>>
>>>>> DROP keyspace if exists kmv;
>>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' :
>>>>> 'SimpleStrategy', 'replication_factor':'1'} ;
>>>>> USE kmv;
>>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY
>>>>> KEY(id, c1));
>>>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING '
>>>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>>>>> 'analyzed' : 'true',
>>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa
>>>>> si.analyzer.NonTokenizingAnalyzer',
>>>>> 'case_sensitive' : 'false',
>>>>> 'mode' : 'CONTAINS'
>>>>> };
>>>>>
>>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>>>>
>>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>>>>
>>>>> _pattern_ '%%%' finds all columns that contain %.
>>>>> How to find columns that start form ‘%’ or ‘%a’?
>>>>> How to find columns that end with ‘%’?
>>>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate
>>>>> ‘%’ char form % as a command symbol? (Also there is a related issue
>>>>> CASSANDRA-12573).
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Mikhail
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Mime
View raw message