cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-11122) SASI does not find term when indexing non-ascii character
Date Fri, 05 Feb 2016 13:05:39 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

DOAN DuyHai updated CASSANDRA-11122:
------------------------------------
    Description: 
I built the snapshot version taken from here: https://github.com/xedin/cassandra/tree/CASSANDRA-11067

I create a tiny musical dataset with non-ascii characters (*cyrillic* actually) and create
a SASI index on the artist name.

SASI can find rows for the cyrillic name but strangely fails to index normal ascii name (_'Object'_).

{code:sql}
CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
'1'}  AND durable_writes = true;

CREATE TABLE music.albums (
    title text PRIMARY KEY,
    artist text
);

INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое Число','СБПЧ
Оркестр');

CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};

SELECT * FROM music.albums;


title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object
       Mild and Hazy |                      Hayden
        СБПЧ Оркестр | Самое Большое Простое Число

(3 rows)

SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число';

title               | artist
---------------------+-----------------------------
        СБПЧ Оркестр | Самое Большое Простое Число

(1 rows)

SELECT * FROM music.albums WHERE artist='Hayden';

title               | artist
---------------------+-----------------------------
       Mild and Hazy |                      Hayden


(1 rows)

SELECT * FROM music.albums WHERE artist='Object';

title               | artist
---------------------+-----------------------------

(0 rows)

SELECT * FROM music.albums WHERE artist like 'Ob%';

title               | artist
---------------------+-----------------------------

(0 rows)
{code}

Strangely enough, after cleaning all the data and re-inserting without the russian artist
with cyrillic name, SASI does find _'Object_' ...

{code:sql}
DROP INDEX albums_artist_idx;
TRUNCATE TABLE albums;

INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');


CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};

SELECT * FROM music.albums;


title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object
       Mild and Hazy |                      Hayden

(2 rows)

SELECT * FROM music.albums WHERE artist='Object';

title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object

(1 rows)

SELECT * FROM music.albums WHERE artist LIKE 'Ob%';

title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object

(1 rows)

{code}

 The behaviour is quite inconsistent. I can understand that SASI refuses to index cyrillic
character or issue exception when encountering non-ascii characters (because we did not specify
the locale) but it's very surprising that the indexing fails for normal ascii characters like
_Object_

 Could it be that SASI start indexing the artist name by following table albums token range
order (hash of title) and it stops indexing after encountering the cyrillic name ? 



  was:
I built the snapshot version taken from here: https://github.com/xedin/cassandra/tree/CASSANDRA-11067

I create a tiny musical dataset with non-ascii characters (*cyrillic* actually) and create
a SASI index on the artist name.

SASI can find rows for the cyrillic name but strangely fails to index normal ascii name (_'Object'_).

{code:sql}
CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
'1'}  AND durable_writes = true;

CREATE TABLE music.albums (
    title text PRIMARY KEY,
    artist text
);

INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое Число','СБПЧ
Оркестр');

CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};

SELECT * FROM music.albums;


title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object
       Mild and Hazy |                      Hayden
        СБПЧ Оркестр | Самое Большое Простое Число

(3 rows)

SELECT * FROM albums WHERE artist='Самое Большое Простое Число';

title               | artist
---------------------+-----------------------------
        СБПЧ Оркестр | Самое Большое Простое Число

(1 rows)

SELECT * FROM albums WHERE artist='Hayden';

title               | artist
---------------------+-----------------------------
       Mild and Hazy |                      Hayden


(1 rows)

SELECT * FROM albums WHERE artist='Object';

title               | artist
---------------------+-----------------------------

(0 rows)

SELECT * FROM albums WHERE artist like 'Ob%';

title               | artist
---------------------+-----------------------------

(0 rows)
{code}

Strangely enough, after cleaning all the data and re-inserting without the russian artist
with cyrillic name, SASI does find _'Object_' ...

{code:sql}
DROP INDEX albums_artist_idx;
TRUNCATE TABLE albums;

INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');

SELECT * FROM music.albums;


title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object
       Mild and Hazy |                      Hayden

(2 rows)

SELECT * FROM albums WHERE artist='Object';

title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object

(1 rows)

SELECT * FROM albums WHERE artist LIKE 'Ob%';

title               | artist
---------------------+-----------------------------
 The Reflecting Skin |                      Object

(1 rows)

{code}

 The behaviour is quite inconsistent. I can understand that SASI refuses to index cyrillic
character or issue exception when encountering non-ascii characters (because we did not specify
the locale) but it's very surprising that the indexing fails for normal ascii characters like
_Object_

 Could it be that SASI start indexing the artist name by following table albums token range
order (hash of title) and it stops indexing after encountering the cyrillic name ? 




> SASI does not find term when indexing non-ascii character
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-11122
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11122
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Cassandra 3.4 SNAPSHOT
>            Reporter: DOAN DuyHai
>         Attachments: CASSANDRA-11122.patch
>
>
> I built the snapshot version taken from here: https://github.com/xedin/cassandra/tree/CASSANDRA-11067
> I create a tiny musical dataset with non-ascii characters (*cyrillic* actually) and create
a SASI index on the artist name.
> SASI can find rows for the cyrillic name but strangely fails to index normal ascii name
(_'Object'_).
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
'1'}  AND durable_writes = true;
> CREATE TABLE music.albums (
>     title text PRIMARY KEY,
>     artist text
> );
> INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
> INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
> INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое
Число','СБПЧ Оркестр');
> CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
> SELECT * FROM music.albums;
> title               | artist
> ---------------------+-----------------------------
>  The Reflecting Skin |                      Object
>        Mild and Hazy |                      Hayden
>         СБПЧ Оркестр | Самое Большое Простое Число
> (3 rows)
> SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число';
> title               | artist
> ---------------------+-----------------------------
>         СБПЧ Оркестр | Самое Большое Простое Число
> (1 rows)
> SELECT * FROM music.albums WHERE artist='Hayden';
> title               | artist
> ---------------------+-----------------------------
>        Mild and Hazy |                      Hayden
> (1 rows)
> SELECT * FROM music.albums WHERE artist='Object';
> title               | artist
> ---------------------+-----------------------------
> (0 rows)
> SELECT * FROM music.albums WHERE artist like 'Ob%';
> title               | artist
> ---------------------+-----------------------------
> (0 rows)
> {code}
> Strangely enough, after cleaning all the data and re-inserting without the russian artist
with cyrillic name, SASI does find _'Object_' ...
> {code:sql}
> DROP INDEX albums_artist_idx;
> TRUNCATE TABLE albums;
> INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
> INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');
> CREATE custom INDEX on music.albums(artist) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
> SELECT * FROM music.albums;
> title               | artist
> ---------------------+-----------------------------
>  The Reflecting Skin |                      Object
>        Mild and Hazy |                      Hayden
> (2 rows)
> SELECT * FROM music.albums WHERE artist='Object';
> title               | artist
> ---------------------+-----------------------------
>  The Reflecting Skin |                      Object
> (1 rows)
> SELECT * FROM music.albums WHERE artist LIKE 'Ob%';
> title               | artist
> ---------------------+-----------------------------
>  The Reflecting Skin |                      Object
> (1 rows)
> {code}
>  The behaviour is quite inconsistent. I can understand that SASI refuses to index cyrillic
character or issue exception when encountering non-ascii characters (because we did not specify
the locale) but it's very surprising that the indexing fails for normal ascii characters like
_Object_
>  Could it be that SASI start indexing the artist name by following table albums token
range order (hash of title) and it stops indexing after encountering the cyrillic name ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message