lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream
Date Tue, 11 Nov 2008 20:01:56 GMT
Michael McCandless (JIRA) wrote:
>      [
> Michael McCandless updated LUCENE-1448:
> ---------------------------------------
>     Attachment: LUCENE-1448.patch
> Attached new patch (changes described below):
> bq. You need that +1 or you will have the subsequent token starting on the tail of the
> So logically it's like we silently & forcefully insert a space between
> the Fieldable instances?
Is it? Lets straighten this out. Here is what I see from my test for:

field = "abcd the"
field = "crunch man"

abcd thecrunch man
a0b1c2d3 4t5h6e7c8r9u10n11c12h13  14m15a16n17

Without the +1 I got:
abcd: 0-4
crunch: 7-13
man: 14-17

Something like that anyway.

With +1 I got:

which seems correct right? With no space injected? Your test seemed to 
ensure the equivalent of crunch starting at 7...which is the 'e' 
position and not correct right?
> OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
> return 1 for analyzed fields and 0 for unanalyzed fields.
> bq. What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength();
> Does that handle spaces at the end of the text?  (Oh it seems like it
> does...I added a test case...hmm).
> bq. i didnt correctly put the SA piece in the jflex file
> I think this change (adding getFinalOffset to StandardTokenizer)
> doesn't need a change to jflex?  (It's only if you edit
My fault. I was assuming I missed it without really looking close.
> Hmm another complexity is handling a field instance that produced no
> tokens.  Currently, we do not increment the cumulative offset by +1 in
> such cases.  But, for position increment gap we always add this gap in
> between fields if any field from the past have produced a token.  I
> added a couple test cases for this.
> Also, I fixed a bug in how CharTokenizer was computing its final
> offset.
> Still todo:
>   - add test cases to cover NOT_ANALYZED fields
>   - fix contrib tokenizers to implement getFinalOffset
>> add getFinalOffset() to TokenStream
>> -----------------------------------
>>                 Key: LUCENE-1448
>>                 URL:
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>            Priority: Minor
>>             Fix For: 2.9
>>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>> If you add multiple Fieldable instances for the same field name to a document, and
you then index those fields with TermVectors storing offsets, it's very likely the offsets
for all but the first field instance will be wrong.
>> This is because IndexWriter under the hood adds a cumulative base to the offsets
of each field instance, where that base is 1 + the endOffset of the last token it saw when
analyzing that field.
>> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being
used, and the text being analyzed ended in 3 whitespace characters, then that information
is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter
appears in the chain, and the last N tokens were stop words, then the base will be 1 + the
endOffset of the last non-stopword token.
>> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking
by default it returns -1, which means "I don't know so you figure it out", meaning we fallback
to the faulty logic we have today.
>> This has come up several times on the user's list.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message