lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-3088) inconsistency of tokenstream.end() with OffsetLimitTokenFilter and LimitTokenCountFilter
Date Thu, 12 May 2011 13:45:47 GMT
inconsistency of tokenstream.end() with OffsetLimitTokenFilter and LimitTokenCountFilter
----------------------------------------------------------------------------------------

                 Key: LUCENE-3088
                 URL: https://issues.apache.org/jira/browse/LUCENE-3088
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Robert Muir


In LUCENE-3064, we added some state and checks to MockTokenizer to validate that consumers
are properly using the tokenstream workflow (described here: http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/TokenStream.html)

One inconsistency is the following steps:
4. The consumer calls incrementToken() until it returns false consuming the attributes after
each call.
5. The consumer calls end() so that any end-of-stream operations can be performed.

In the case of these limitingfilters, end() is called on the Tokenizer *before* incrementToken()
returns false. This is a little strange for a few reasons: one is that the tokenizer might
not even be "ready" for end(), e.g. it might be coded where end() only works correctly if
its entirely consumed. The other problem of course is that the finalOffset, the general use
of end(), will most often be wrong in this case, so multi-valued field highlighting will not
work.

We should probably figure out a way to address the inconsistency, some ideas are:
# fixing the javadocs, perhaps documenting that end() could be called at any time, and accepting
the fact that the finalOffset will be wrong.
# the limiting filters could consume the rest of the tokens in a while (incrementToken())
loop to ensure totally proper behavior.
# the limiting filters could do something tricky like override end() so that its not invoked
on the Tokenizer in a surprising state. This is still evil but perhaps less evil than calling
it "out of order".
# ...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message