cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
Date Sun, 26 Jun 2016 07:54:33 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350028#comment-15350028
] 

DOAN DuyHai commented on CASSANDRA-12078:
-----------------------------------------

[~xedin]

 I have been able to reproduce the unit test failing locally. The error comes from test {{testTokenizationAdventuresOfHuckFinn}}.
After switching skip stop words before stemming, the expected tokens count is *37739* and
not *40249*

 There is also a {{NullPointerException}} when switching skip stop words before stemming.
Indeed in some case, the token is removed by stop words filter so the input of the stemming
filter is null. I've added an extra null check in the {{DefaultStemmingFilter}}

{code:java}
        public String process(String input) throws Exception
        {
            if (input == null || stemmer == null)
                return input;
            stemmer.setCurrent(input);
            return (stemmer.stem()) ? stemmer.getCurrent() : input;
        }
{code}

 I have also added a new unit test in {{StandardAnalyzerTest}} to cover the french issue mentioned
above:

{code:java}
    @Test
    public void testSkipStopWordBeforeStemmingFrench() throws Exception
    {
        InputStream is = StandardAnalyzerTest.class.getClassLoader()
               .getResourceAsStream("tokenization/french_skip_stop_words_before_stemming.txt");

        StandardTokenizerOptions options = new StandardTokenizerOptions.OptionsBuilder().stemTerms(true)
                .ignoreStopTerms(true).useLocale(Locale.FRENCH)
                .alwaysLowerCaseTerms(true).build();
        StandardAnalyzer tokenizer = new StandardAnalyzer();
        tokenizer.init(options);

        List<ByteBuffer> tokens = new ArrayList<>();
        List<String> words = new ArrayList<>();
        tokenizer.reset(is);
        while (tokenizer.hasNext())
        {
            final ByteBuffer nextToken = tokenizer.next();
            tokens.add(nextToken);
            words.add(UTF8Serializer.instance.deserialize(nextToken.duplicate()));
        }

        assertEquals(4, tokens.size());
        assertEquals("dans", words.get(0));
        assertEquals("plui", words.get(1));
        assertEquals("chanson", words.get(2));
        assertEquals("connu", words.get(3));
    }
{code}

> [SASI] Move skip_stop_words filter BEFORE stemming
> --------------------------------------------------
>
>                 Key: CASSANDRA-12078
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
>             Project: Cassandra
>          Issue Type: Bug
>          Components: sasi
>         Environment: Cassandra 3.7, Cassandra 3.8
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.8
>
>         Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put stemming in the
filter pipeline BEFORE skip_stop_words:
> {code:java}
>     private FilterPipelineTask getFilterPipeline()
>     {
>         FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation());
>      ...
>         if (options.shouldStemTerms())
>             builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale()));
>         if (options.shouldIgnoreStopTerms())
>             builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale()));
>         return builder.build();
>     }
> {code}
> The problem is that stemming before removing stop words can yield wrong results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final vowel is
removed). Then skip stop words is applied. Unfortunately *dans* (*in* in English) is a stop
word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}}
and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message