lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene 4.0 tokenstream logic
Date Thu, 11 Jul 2013 07:57:52 GMT
Please post the code of your TokenStream(s) thats behind this analyzer? The bug is there (and
it is a bug, if it is not working correctly). The Lucene internal analyzers don't have this
problem, as the TokenStreams behind that are implemented and tested. In most cases those problems
appear, if the underlying TokenStream does not correctly implement token caching (with captureState
or cloneAttributes) or fails to implement reset() correctly.

The problem you have is: In Lucene 4.x it is *required* for a TokenStream to support reusing,
so reset() must be implemented and restore a consistent state.

If you want to test your custom TokenStreams analyzers, you should use BaseTokenStreamTestCase
from the Lucene test-framework. It wil shouw you all misuse of APIs inside TokenStream implementations
(like incorrectly caching tokens not using captureState/restoreState and so on).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: zzT [mailto:zis.tax@gmail.com]
> Sent: Thursday, July 11, 2013 9:31 AM
> To: java-user@lucene.apache.org
> Subject: Lucene 4.0 tokenstream logic
> 
> Hi all,
> 
> I'm migrating from Lucene 3.6.1 to 4.3.1 and there seems to be a major
> change in how analyzers work....
> Given the code example below (which is almost copied from
> http://lucene.apache.org/core/4_3_1/core/index.html)
> 
> @Test
> public void testAnalysis() throws IOException {
>         final String[] texts = {"demo", "TokenStream", "API"};
>         CustomAnalyzer analyzer = new CustomAnalyzer(IndexLocale.ENGLISH,
> false);
> 
>         for (String text : texts) {
>             TokenStream stream = analyzer.tokenStream("field", new
> StringReader(text));
>             CharTermAttribute termAtt =
> stream.addAttribute(CharTermAttribute.class);
> 
>             try {
>                 stream.reset();
>                 while (stream.incrementToken()) {
>                     System.out.println("Token : " + termAtt.toString());
>                 }
>                 stream.end();
>             } finally {
>                 stream.close();
>             }
>         }
>     }
> 
> The output is the following
> in 3.6.1 :
> Token : demo
> Token : Tokenstream
> Token : API
> 
> while in
> 4.3.1 :
> Token : demo
> 
> This is happening because of the ReuseStrategy that is now embedded
> inside Analyzer.TokenStream which caches the 1st token ("demo") and
> reuses this one afterwards.
> 
> CustomAnalyzer is a custom analyzer :) and its implementation is irrelevant to
> the question (apart from the fact that in 3.6.1 it overrides
> tokenStream() while in 4.3.1 it overrides createComponents() ). I'm pretty
> sure the same is happening with Lucene's analyzers too.
> 
> The question is : Do I need to change something in my logic to make it work
> as in 3.6.1? The only way to get the same output is by initializing
> CustomAnalyzer before calling tokenstream().
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Lucene-
> 4-0-tokenstream-logic-tp4077203.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message