cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
Date Thu, 23 Jun 2016 12:20:16 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

DOAN DuyHai updated CASSANDRA-12078:
------------------------------------
    Description: 
Right now, if skip stop words and stemming are enabled, SASI will put stemming in the filter
pipeline BEFORE skip_stop_words:

{code:java}

    private FilterPipelineTask getFilterPipeline()
    {
        FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation());
     ...
        if (options.shouldStemTerms())
            builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale()));
        if (options.shouldIgnoreStopTerms())
            builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale()));
        return builder.build();
    }
{code}

The problem is that stemming before removing stop words can yield wrong results.

I have an example:

{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
{code}

Because of stemming *danse* ( *dance* in English) becomes *dans* (the final vowel is removed).
Then skip stop words is applied. Unfortunately *dans* (*in* in English) is a stop word in
French so it is removed completely.

In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}}
and of course the results are wrong.

Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter

/cc [~xedin] [~jrwest] [~beobal]

  was:
Right now, if skip stop words and stemming are enabled, SASI will put stemming in the filter
pipeline BEFORE skip_stop_words:

{code:java}

    private FilterPipelineTask getFilterPipeline()
    {
        FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation());
     ...
        if (options.shouldStemTerms())
            builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale()));
        if (options.shouldIgnoreStopTerms())
            builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale()));
        return builder.build();
    }
{code}

The problem is that stemming before removing stop words can yield wrong results.

I have an example:

{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
{code}

*danse* = *dance* in English, and because of stemming, it becomes *dans* (the final vowel
is removed). Then skip stop words is applied. Unfortunately *dans* = *in* in English, a stop
word in French so it is removed completely.

In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}}
and of course the results are wrong.

Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter


> [SASI] Move skip_stop_words filter BEFORE stemming
> --------------------------------------------------
>
>                 Key: CASSANDRA-12078
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>         Environment: Cassandra 3.7, Cassandra 3.8
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>         Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put stemming in the
filter pipeline BEFORE skip_stop_words:
> {code:java}
>     private FilterPipelineTask getFilterPipeline()
>     {
>         FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation());
>      ...
>         if (options.shouldStemTerms())
>             builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale()));
>         if (options.shouldIgnoreStopTerms())
>             builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale()));
>         return builder.build();
>     }
> {code}
> The problem is that stemming before removing stop words can yield wrong results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final vowel is
removed). Then skip stop words is applied. Unfortunately *dans* (*in* in English) is a stop
word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}}
and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message