lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Reuschling <christian.reuschl...@gmail.com>
Subject term offsets wrong depending on analyzer
Date Fri, 07 Nov 2008 17:52:34 GMT
Hi Guys,

I currently have a bug of wrong term offset values for fields analyzed
with KeywordAnalyzer (and also unanalyzed fields, whereby I assume that
the code may be the same)

The offset of a field seems to be incremented by the entry length of the
previously analyzed field.

I had a look into the code of KeywordAnalyzer - and have seen it don't sets
the offsets in any case. I wrote my own Analyzer based on KeywordAnalyzer
and added the two lines

            reusableToken.setStartOffset(0);
            reusableToken.setEndOffset(upto);

inside KeywordTokenizer.next(..). It seems to work now (at least the one scenario
with the KeywordAnalyzer)

I created a snippet that reproduce both situations, see attachement.

This snippet also demonstrates another bug I found for term offsets according
to fields with multiple values. According to the Analyzer, several letters will
recognized as delimiter for Tokenization. In the case these delimiters are at
the end of the first value inside a field, the offsets of all following field
values are decremented by the count of these delimiters.. it seems the offset
calculation forgets them.

This makes highlighting of hits from values up to the second one impossible.
Currently I have a workaround where I count the Analyzer-specific delimiters at
the end of all values, and adjust the offsets given from Lucene with these. It
works, but isn't nice of course.

These situations appear with the current 2.4RC2


I hope this will help a little, greetings


Christian Reuschling

Mime
View raw message