cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Krupitskiy <mikhail.krupits...@jetbrains.com>
Subject Re: How to query '%' character using LIKE operator in Cassandra 3.7?
Date Tue, 04 Oct 2016 19:54:38 GMT
Please see my comments inline.

Thanks,
Mikhail
> On 26 Sep 2016, at 17:07, DuyHai Doan <doanduyhai@gmail.com> wrote:
> 
> "In the current implementation (‘%’ could be a wildcard only at the start/end of
a term) I guess it should be ’ENDS with ‘%escape’ ‘." 
> 
> --> Yes in the current impl, it means ENDS WITH '%escape' but we want SASI to understand
the %% as an escape for % so the goal is that SASI understands LIKE '%%escape' as EQUALS TO
'%escape'. Am I correct ?
I guess that the goal is to define a way to use ‘%’ as a simple char.
LIKE '%escape' - ENDS WITH 'escape'
LIKE '%%escape' - EQUALS TO '%escape’
LIKE '%%escape%' - STARTS WITH '%escape’

LIKE ‘%%%escape’ - undefined in general case
LIKE ‘%%%escape’ - ENDS WITH “%escape” in a case when we know that a wildcard could
be only at the start/end.
> 
> "Moreover all terms that contains single ‘%’ somewhere in the middle should cause
an exception."
> 
> --> Not necessarily, sometime people may want to search text pattern containing the
literal %. Imagine the text "this year the average income has increase by 10%". People may
want to search for "10%”.
If someone wants to search for ’10%’ then he should escape the ‘%’ char: like “10%%”.
> 
> 
> 
> "BUT may be it’s better to make escaping more universal to support a future possible
case where a wildcard could be placed in the middle of a term too?"
> 
> --> I guess universal escaping for % is the cleaner and better solution. However it
may involve some complex regular expression. I'm not sure that input.replaceAll("%%", "%")
trick would work for any cases.
As I wrote I don’t think that it’s possible to do universal escaping using ‘%’ as
an escape char (a char to escape wildcard char to make it a simple char semantically) and
as wildcard at the same time.
I suggest to use “\” as an escape char.
Also I don’t know enough about Cassandra’s internals to estimate how universal escaping
will affect performance.
It really looks like a better solution for Cassandra users.
> 
> And we also need to define when we want to detect operation type (LIKE_PREFIX, LIKE_SUFFIX,
LIKE_CONTAINS, EQUAL) ? 
> 
> Should we detect operation type BEFORE escaping or AFTER escaping ?
As I understand ‘escaping' will be done by users. 
So on DB level we get an already escaped string from a request and it’s possible to know
which symbol is a wildcard and which is just a char.
I guess that Cassandra should parse (unescape?) an incoming string to define wildcards positions
and simple chars positions and then define an operation type.

 
> 
> 
> 
> 
> 
> On Mon, Sep 26, 2016 at 3:54 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
> In the current implementation (‘%’ could be a wildcard only at the start/end of a
term) I guess it should be ’ENDS with ‘%escape’ ‘.
> Moreover all terms that contains single ‘%’ somewhere in the middle should cause
an exception.
> BUT may be it’s better to make escaping more universal to support a future possible
case where a wildcard could be placed in the middle of a term too?
> 
> Thanks,
> Mikhail 
>> On 24 Sep 2016, at 21:09, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>> 
>> Reminder, right now, the % character is only interpreted as wildcard IF AND ONLY
IF it is the first/last character of the searched term
>> 
>> 
>> LIKE '%escape' --> ENDS WITH 'escape' 
>> 
>> If we use % to escape %,
>> LIKE '%%escape' -->  EQUALS TO '%escape'
>> 
>> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
>> 
>> 
>> 
>> 
>> On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>> Hi, Jim,
>> 
>> What pattern should be used to search “ends with  ‘%escape’ “ with your conception?
>> 
>> Thanks,
>> Mikhail
>> 
>>> On 22 Sep 2016, at 18:51, Jim Ancona <jim@anconafamily.com <mailto:jim@anconafamily.com>>
wrote:
>>> 
>>> To answer DuyHai's question without introducing new syntax, I'd suggest:
>>>> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape' 
>>> So the first two %'s are translated to a literal, non-wildcard % and the third
% is a wildcard because it's not doubled.
>>> 
>>> Jim
>>> 
>>> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>> I guess that it should be similar to how it is done in SQL for LIKE patterns.
>>> 
>>> You can introduce an escape character, e.g. ‘\’.
>>> Examples:
>>> ‘%’ - any string
>>> ‘\%’ - equal to ‘%’ character
>>> ‘\%foo%’ - starts from ‘%foo’
>>> ‘%%%escape’ - ends with ’escape’
>>> ‘\%%’ - starts from ‘%’
>>> ‘\\\%%’ - starts from ‘\%’ .
>>> 
>>> What do you think?
>>> 
>>> Thanks,
>>> Mikhail
>>>> On 22 Sep 2016, at 16:47, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>>>> 
>>>> Hello Mikhail
>>>> 
>>>> It's more complicated that it seems
>>>> 
>>>> LIKE '%%escape' means  EQUAL TO '%escape'
>>>> 
>>>> LIKE '%escape' means ENDS WITH 'escape'
>>>> 
>>>> What's about LIKE '%%%escape' ????
>>>> 
>>>> How should we treat this case ? Replace %% by % at the beginning of the searched
term ??
>>>> 
>>>> 
>>>> 
>>>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>> Hi!
>>>> 
>>>> We’ve talked about two items:
>>>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>>>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char
with help of LIKE.
>>>> 
>>>> Item #1was resolved as CASSANDRA-12573.
>>>> 
>>>> Regarding to item #2: you said the following:
>>>>> A possible fix would be:
>>>>> 
>>>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending
on the column data type)
>>>>> 2) remove the escape character e.g. before parsing OR use some advanced
regex to exclude the %% from parsing e.g
>>>>> 
>>>>> Step 2) is dead easy but step 1) is harder because I don't know if converting
the bytebuffer into String at this stage of the CQL parser is expensive or not (in term of
computation)
>>>>> 
>>>>> Let me try a patch 
>>>> 
>>>> So is there any update on this?
>>>> 
>>>> Thanks,
>>>> Mikhail
>>>> 
>>>> 
>>>>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>>> 
>>>>> Hi!
>>>>> 
>>>>> Have you had a chance to try your patch or solve the issue in an other
way? 
>>>>> 
>>>>> Thanks,
>>>>> Mikhail
>>>>>> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduyhai@gmail.com <mailto:doanduyhai@gmail.com>>
wrote:
>>>>>> 
>>>>>> Ok so I've found the source of the issue, it's pretty well hidden
because it is NOT in the SASI source code directly.
>>>>>> 
>>>>>> Here is the method where C* determines what kind of LIKE expression
you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>>>>> 
>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
<https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778>
>>>>>> 
>>>>>> As you can see, it's pretty simple, maybe too simple. Indeed, they
forget to remove escape character BEFORE doing the matching so if your search is LIKE '%%esc%',
the detected expression is LIKE_CONTAINS.
>>>>>> 
>>>>>> A possible fix would be:
>>>>>> 
>>>>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending
on the column data type)
>>>>>> 2) remove the escape character e.g. before parsing OR use some advanced
regex to exclude the %% from parsing e.g
>>>>>> 
>>>>>> Step 2) is dead easy but step 1) is harder because I don't know if
converting the bytebuffer into String at this stage of the CQL parser is expensive or not
(in term of computation)
>>>>>> 
>>>>>> Let me try a patch  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduyhai@gmail.com
<mailto:doanduyhai@gmail.com>> wrote:
>>>>>> Ok you're right, I get your point
>>>>>> 
>>>>>> LIKE '%%esc%' --> startWith('%esc')
>>>>>> 
>>>>>> LIKE 'escape%%' -->  = 'escape%'
>>>>>> 
>>>>>> What I strongly suspect is that in the source code of SASI, we parse
the % xxx % expression BEFORE applying escape. That will explain the observed behavior. E.g:
>>>>>> 
>>>>>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>>>>> 
>>>>>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>>>>> 
>>>>>> Let me check in the source code and try to reproduce the issue
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>>>> Looks like we have different understanding of what results are expected.
>>>>>> I based my understanding on http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html
<http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html>
>>>>>> According to the doc ‘esc’ is a pattern for exact match and I
guess that there is no semantical difference between two LIKE patterns (both of patterns should
be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>>>>> 
>>>>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all
results containing '%esc' so %escapeme is a possible match and also escape%esc
>>>>>> Why ‘containing’? I expect that it should be ’starting’..
>>>>>>> 
>>>>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all
results starting with 'escape%' so escape%me is a valid result and also escape%esc
>>>>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>>>> 
>>>>>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’
but it returns nothing (CASSANDRA-12573).
>>>>>> 
>>>>>> What I’m missing?
>>>>>> 
>>>>>> Thanks,
>>>>>> Mikhail
>>>>>> 
>>>>>>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduyhai@gmail.com
<mailto:doanduyhai@gmail.com>> wrote:
>>>>>>> 
>>>>>>> CREATE CUSTOM INDEX ON test.escape(val) USING 'org.apache.cassandra.index.sa
<http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sa <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
>>>>>>> 
>>>>>>> I don't see any problem in the results you got
>>>>>>> 
>>>>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all
results containing '%esc' so %escapeme is a possible match and also escape%esc
>>>>>> Why ‘containing’? I expect that it should be ’starting’..
>>>>>>> 
>>>>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all
results starting with 'escape%' so escape%me is a valid result and also escape%esc
>>>>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>>>>> Thanks for the reply.
>>>>>>> Could you please provide what index definition did you use?
>>>>>>> With the index from my script I get the following results:
>>>>>>> 
>>>>>>> cqlsh:test> select * from escape;
>>>>>>> 
>>>>>>>  id | val
>>>>>>> ----+-----------
>>>>>>>   1 | %escapeme
>>>>>>>   2 | escape%me
>>>>>>>   3 | escape%esc
>>>>>>> 
>>>>>>> Contains search
>>>>>>> 
>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>>>> 
>>>>>>>  id | val
>>>>>>> ----+-----------
>>>>>>>   1 | %escapeme
>>>>>>>   3 | escape%esc
>>>>>>> (2 rows)
>>>>>>> 
>>>>>>> 
>>>>>>> Prefix search
>>>>>>> 
>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>>>> 
>>>>>>>  id | val
>>>>>>> ----+-----------
>>>>>>>   2 | escape%me
>>>>>>>   3 | escape%esc
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Mikhail 
>>>>>>> 
>>>>>>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduyhai@gmail.com
<mailto:doanduyhai@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Use % to escape %
>>>>>>>> 
>>>>>>>> cqlsh:test> select * from escape;
>>>>>>>> 
>>>>>>>>  id | val
>>>>>>>> ----+-----------
>>>>>>>>   1 | %escapeme
>>>>>>>>   2 | escape%me
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Contains search
>>>>>>>> 
>>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>>>>> 
>>>>>>>>  id | val
>>>>>>>> ----+-----------
>>>>>>>>   1 | %escapeme
>>>>>>>> 
>>>>>>>> (1 rows)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Prefix search
>>>>>>>> 
>>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>>>>> 
>>>>>>>>  id | val
>>>>>>>> ----+-----------
>>>>>>>>   2 | escape%me
>>>>>>>> 
>>>>>>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy <mikhail.krupitskiy@jetbrains.com
<mailto:mikhail.krupitskiy@jetbrains.com>> wrote:
>>>>>>>> Hi Cassandra guys,
>>>>>>>> 
>>>>>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a
simple char in a search pattern.
>>>>>>>> Here is my test script:
>>>>>>>> 
>>>>>>>> DROP keyspace if exists kmv;
>>>>>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class'
: 'SimpleStrategy', 'replication_factor':'1'} ;
>>>>>>>> USE kmv;
>>>>>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text,
PRIMARY KEY(id, c1));
>>>>>>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 'org.apache.cassandra.index.sa
<http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {
>>>>>>>> 'analyzed' : 'true',
>>>>>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer',
>>>>>>>> 'case_sensitive' : 'false',
>>>>>>>> 'mode' : 'CONTAINS'
>>>>>>>> };
>>>>>>>> 
>>>>>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>>>>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>>>>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>>>>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>>>>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>>>>>>> 
>>>>>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>>>>>>> 
>>>>>>>> _pattern_ '%%%' finds all columns that contain %.
>>>>>>>> How to find columns that start form ‘%’ or ‘%a’?
>>>>>>>> How to find columns that end with ‘%’?
>>>>>>>> What about more complex patterns: '%qwe%a%sd%’? How to
differentiate ‘%’ char form % as a command symbol? (Also there is a related issue CASSANDRA-12573).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Mikhail
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message