lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream
Date Tue, 11 Nov 2008 19:43:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1448:
---------------------------------------

    Attachment: LUCENE-1448.patch

Attached new patch (changes described below):

bq. You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.


So logically it's like we silently & forcefully insert a space between
the Fieldable instances?

Maybe we should add Analyzer.getOffsetGap(String fieldName), which by
default would return 1, and we then add that into the offset for
subsequent field instances?

But then here's another challenge: for NOT_ANALYZED fields we don't
add this extra +1.  We just add the string length.  Hmm.

OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
return 1 for analyzed fields and 0 for unanalyzed fields.

bq. What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength();
}

Does that handle spaces at the end of the text?  (Oh it seems like it
does...I added a test case...hmm).

bq. i didnt correctly put the SA piece in the jflex file

I think this change (adding getFinalOffset to StandardTokenizer)
doesn't need a change to jflex?  (It's only if you edit
StandardTokenizerImpl.java).

Hmm another complexity is handling a field instance that produced no
tokens.  Currently, we do not increment the cumulative offset by +1 in
such cases.  But, for position increment gap we always add this gap in
between fields if any field from the past have produced a token.  I
added a couple test cases for this.

Also, I fixed a bug in how CharTokenizer was computing its final
offset.

Still todo:
  - add test cases to cover NOT_ANALYZED fields
  - fix contrib tokenizers to implement getFinalOffset


> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you
then index those fields with TermVectors storing offsets, it's very likely the offsets for
all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each
field instance, where that base is 1 + the endOffset of the last token it saw when analyzing
that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being
used, and the text being analyzed ended in 3 whitespace characters, then that information
is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter
appears in the chain, and the last N tokens were stop words, then the base will be 1 + the
endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by
default it returns -1, which means "I don't know so you figure it out", meaning we fallback
to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message