cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
Date Sun, 26 Jun 2016 07:54:33 GMT


DOAN DuyHai commented on CASSANDRA-12078:


 I have been able to reproduce the unit test failing locally. The error comes from test {{testTokenizationAdventuresOfHuckFinn}}.
After switching skip stop words before stemming, the expected tokens count is *37739* and
not *40249*

 There is also a {{NullPointerException}} when switching skip stop words before stemming.
Indeed in some case, the token is removed by stop words filter so the input of the stemming
filter is null. I've added an extra null check in the {{DefaultStemmingFilter}}

        public String process(String input) throws Exception
            if (input == null || stemmer == null)
                return input;
            return (stemmer.stem()) ? stemmer.getCurrent() : input;

 I have also added a new unit test in {{StandardAnalyzerTest}} to cover the french issue mentioned

    public void testSkipStopWordBeforeStemmingFrench() throws Exception
        InputStream is = StandardAnalyzerTest.class.getClassLoader()

        StandardTokenizerOptions options = new StandardTokenizerOptions.OptionsBuilder().stemTerms(true)
        StandardAnalyzer tokenizer = new StandardAnalyzer();

        List<ByteBuffer> tokens = new ArrayList<>();
        List<String> words = new ArrayList<>();
        while (tokenizer.hasNext())
            final ByteBuffer nextToken =;

        assertEquals(4, tokens.size());
        assertEquals("dans", words.get(0));
        assertEquals("plui", words.get(1));
        assertEquals("chanson", words.get(2));
        assertEquals("connu", words.get(3));

> [SASI] Move skip_stop_words filter BEFORE stemming
> --------------------------------------------------
>                 Key: CASSANDRA-12078
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: sasi
>         Environment: Cassandra 3.7, Cassandra 3.8
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.8
>         Attachments: patch.txt
> Right now, if skip stop words and stemming are enabled, SASI will put stemming in the
filter pipeline BEFORE skip_stop_words:
> {code:java}
>     private FilterPipelineTask getFilterPipeline()
>     {
>         FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation());
>      ...
>         if (options.shouldStemTerms())
>             builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale()));
>         if (options.shouldIgnoreStopTerms())
>             builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale()));
>         return;
>     }
> {code}
> The problem is that stemming before removing stop words can yield wrong results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final vowel is
removed). Then skip stop words is applied. Unfortunately *dans* (*in* in English) is a stop
word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}}
and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter
> /cc [~xedin] [~jrwest] [~beobal]

This message was sent by Atlassian JIRA

View raw message