lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Offset bug in WordDelimiterFilter?
Date Tue, 06 Dec 2016 11:27:15 GMT
Hello - i noticed something peculiar running Lucene/Solr 6.3.0.

The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset of 21 when
passed through WordDelimiterFilter and/or stemmers but it isn't, slightly messing up highlighted
terms.

    wdf = new WordDelimiterFilter(new CannedTokenStream(new Token("vaccinatieprogramma's",
0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, null);    
    assertTokenStreamContents(wdf,
        new String[] { "vaccinatieprogramma"},
        new int[] { 0 },
        new int[] { 21 });

   [junit4] Suite: org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestWordDelimiterFilter -Dtests.method=testOffsets
-Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG -Dtests.timezone=Etc/GMT+10
-Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: endOffset 0 expected:<21>
but was:<19>

I would expect the same behaviour a stemmers, the length of the term is always the length
of the original term. So if a user queries for a sigular term, the whole plural (original)
is highlighted.

Am i missing something? Bug?

Thanks,
Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message