cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Krupitskiy <mikhail.krupits...@jetbrains.com>
Subject Re: How to query '%' character using LIKE operator in Cassandra 3.7?
Date Thu, 22 Sep 2016 13:31:11 GMT
Hi!

We’ve talked about two items:
1) ‘%’ as a wildcard in the middle of LIKE pattern.
2) How to escape ‘%’ to be able to find strings with the ‘%’ char with help of LIKE.

Item #1was resolved as CASSANDRA-12573.

Regarding to item #2: you said the following:
> A possible fix would be:
> 
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the column data
type)
> 2) remove the escape character e.g. before parsing OR use some advanced regex to exclude
the %% from parsing e.g
> 
> Step 2) is dead easy but step 1) is harder because I don't know if converting the bytebuffer
into String at this stage of the CQL parser is expensive or not (in term of computation)
> 
> Let me try a patch 

So is there any update on this?

Thanks,
Mikhail


> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com>
wrote:
> 
> Hi!
> 
> Have you had a chance to try your patch or solve the issue in an other way? 
> 
> Thanks,
> Mikhail
>> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>> 
>> Ok so I've found the source of the issue, it's pretty well hidden because it is NOT
in the SASI source code directly.
>> 
>> Here is the method where C* determines what kind of LIKE expression you're using
(LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>> 
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
<https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778>
>> 
>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget to remove
escape character BEFORE doing the matching so if your search is LIKE '%%esc%', the detected
expression is LIKE_CONTAINS.
>> 
>> A possible fix would be:
>> 
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the column
data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced regex to
exclude the %% from parsing e.g
>> 
>> Step 2) is dead easy but step 1) is harder because I don't know if converting the
bytebuffer into String at this stage of the CQL parser is expensive or not (in term of computation)
>> 
>> Let me try a patch  
>> 
>> 
>> 
>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>> Ok you're right, I get your point
>> 
>> LIKE '%%esc%' --> startWith('%esc')
>> 
>> LIKE 'escape%%' -->  = 'escape%'
>> 
>> What I strongly suspect is that in the source code of SASI, we parse the % xxx %
expression BEFORE applying escape. That will explain the observed behavior. E.g:
>> 
>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>> 
>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>> 
>> Let me check in the source code and try to reproduce the issue
>> 
>> 
>> 
>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>> Looks like we have different understanding of what results are expected.
>> I based my understanding on http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html
<http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html>
>> According to the doc ‘esc’ is a pattern for exact match and I guess that there
is no semantical difference between two LIKE patterns (both of patterns should be treated
as ‘exact match'): ‘%%esc’ and ‘esc’.
>> 
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results containing
'%esc' so %escapeme is a possible match and also escape%esc
>> Why ‘containing’? I expect that it should be ’starting’..
>>> 
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting
with 'escape%' so escape%me is a valid result and also escape%esc
>> Why ’starting’? I expect that it should be ‘exact matching’.
>> 
>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it
returns nothing (CASSANDRA-12573).
>> 
>> What I’m missing?
>> 
>> Thanks,
>> Mikhail
>> 
>>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>>> 
>>> CREATE CUSTOM INDEX ON test.escape(val) USING 'org.apache.cassandra.index.sa
<http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sa <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
>>> 
>>> I don't see any problem in the results you got
>>> 
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results containing
'%esc' so %escapeme is a possible match and also escape%esc
>> Why ‘containing’? I expect that it should be ’starting’..
>>> 
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting
with 'escape%' so escape%me is a valid result and also escape%esc
>> Why ’starting’? I expect that it should be ‘exact matching’.
>> 
>>> 
>>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>> Thanks for the reply.
>>> Could you please provide what index definition did you use?
>>> With the index from my script I get the following results:
>>> 
>>> cqlsh:test> select * from escape;
>>> 
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   2 | escape%me
>>>   3 | escape%esc
>>> 
>>> Contains search
>>> 
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>> 
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   3 | escape%esc
>>> (2 rows)
>>> 
>>> 
>>> Prefix search
>>> 
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>> 
>>>  id | val
>>> ----+-----------
>>>   2 | escape%me
>>>   3 | escape%esc
>>> 
>>> Thanks,
>>> Mikhail 
>>> 
>>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>>>> 
>>>> Use % to escape %
>>>> 
>>>> cqlsh:test> select * from escape;
>>>> 
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>>   2 | escape%me
>>>> 
>>>> 
>>>> Contains search
>>>> 
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>> 
>>>>  id | val
>>>> ----+-----------
>>>>   1 | %escapeme
>>>> 
>>>> (1 rows)
>>>> 
>>>> 
>>>> Prefix search
>>>> 
>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>> 
>>>>  id | val
>>>> ----+-----------
>>>>   2 | escape%me
>>>> 
>>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>> Hi Cassandra guys,
>>>> 
>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char in
a search pattern.
>>>> Here is my test script:
>>>> 
>>>> DROP keyspace if exists kmv;
>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor':'1'} ;
>>>> USE kmv;
>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id,
c1));
>>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 'org.apache.cassandra.index.sa
<http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {
>>>> 'analyzed' : 'true',
>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer',
>>>> 'case_sensitive' : 'false',
>>>> 'mode' : 'CONTAINS'
>>>> };
>>>> 
>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>>> 
>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>>> 
>>>> _pattern_ '%%%' finds all columns that contain %.
>>>> How to find columns that start form ‘%’ or ‘%a’?
>>>> How to find columns that end with ‘%’?
>>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate ‘%’
char form % as a command symbol? (Also there is a related issue CASSANDRA-12573).
>>>> 
>>>> 
>>>> Thanks,
>>>> Mikhail
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 


Mime
View raw message