lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Using an AnalyzerWrapper with ASCIIFoldingFilter
Date Fri, 15 Mar 2013 18:25:39 GMT
Hi,

The API did not really change. The bug is in your test:
If you would carefully read the javadocs of the TokenStream interface, you would notice that
your consumer does not follow the correct workflow: http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/analysis/TokenStream.html

In short, before calling incrementToken() the TokenStream must be reset(). This did not change
and was always the case. In earlier Lucene versions, lots of TokenStreams were behaving wrong,
so we made the basic Tokenizers "fail" in some way. The Exception is not really helpful here,
but for performance reasons this was the only way to go.

Please always take care that the described workflow in the Javadocs is always used from top
to bottom (including end() and close()), otherwise behavior of TokenStreams is not guaranteed
to be correct.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Steven Schlansker [mailto:steven@likeness.com]
> Sent: Friday, March 15, 2013 7:18 PM
> To: java-user@lucene.apache.org
> Subject: Using an AnalyzerWrapper with ASCIIFoldingFilter
> 
> Hi everyone,
> 
> I am trying to port forward to 4.2 some Lucene 3.2-era code that uses the
> ASCIIFoldingFilter.
> The token stream handling has changed significantly since them, and I cannot
> figure out what I am doing wrong.
> 
> It seems that I should extend AnalyzerWrapper so that I can intercept the
> TokenStream and filter it with the ASCIIFoldingFilter.
> 
> I have written the following code:
> 
> public final class TokenFilterAnalyzerWrapper extends AnalyzerWrapper {
>     private final Analyzer baseAnalyzer;
>     private final TokenFilterFactory tokenFilterFactory;
> 
>     public TokenFilterAnalyzerWrapper(Analyzer baseAnalyzer,
> TokenFilterFactory tokenFilterFactory) {
>         this.baseAnalyzer = baseAnalyzer;
>         this.tokenFilterFactory = tokenFilterFactory;
>     }
> 
>     @Override
>     public void close() {
>         baseAnalyzer.close();
>         super.close();
>     }
> 
>     @Override
>     protected Analyzer getWrappedAnalyzer(String fieldName)
>     {
>         return baseAnalyzer;
>     }
> 
>     @Override
>     protected TokenStreamComponents wrapComponents(String fieldName,
> TokenStreamComponents components)
>     {
>         return new TokenStreamComponents(components.getTokenizer(),
> tokenFilterFactory.create(components.getTokenStream()));
>     }
> }
> 
> and the following test case:
> 
> public class TokenFilterAnalyzerWrapperTest {
>     @Test
>     public void testFilter() throws Exception
>     {
>         char[] expected = {'a', 'e', 'i', 'o', 'u'};
>         try (Analyzer analyzer = new TokenFilterAnalyzerWrapper(new
> StandardAnalyzer(Version.LUCENE_42), new ASCIIFoldingFilterFactory())) {
>             TokenStream stream = analyzer.tokenStream("test", new
> StringReader("a é î ø ü"));
> 
>             for (int i = 0; i < 5; i++) {
>                 assertTrue(stream.incrementToken());
>                 assertEquals(Character.toString(expected[i]),
> stream.getAttribute(CharTermAttribute.class).toString());
>             }
> 
>             assertFalse(stream.incrementToken());
>         }
>     }
> }
> 
> but all I can produce is this NullPointerException:
> java.lang.NullPointerException
> 	at
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Standar
> dTokenizerImpl.java:923)
> 	at
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
> StandardTokenizerImpl.java:1133)
> 	at
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(Sta
> ndardTokenizer.java:180)
> 	at
> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(Standar
> dFilter.java:49)
> 	at
> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCas
> eFilter.java:54)
> 	at
> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(Filtering
> TokenFilter.java:50)
> 	at
> org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken
> (ASCIIFoldingFilter.java:71)
> 	at
> xyz.search.lucene.TokenFilterAnalyzerWrapperTest.testFilter(TokenFilterAn
> alyzerWrapperTest.java:27)
> 
> StandardTokenizerImpl.java:923 is
>     /* finally: fill the buffer with new input */
>     int numRead = zzReader.read(zzBuffer, zzEndRead,
>                                             zzBuffer.length-zzEndRead);
> 
> The "reader" is clearly the unexpectedly null value, however I cannot figure
> out how to set it correctly.
> 
> Through experimentation, it seems that I can evade some problems by
> calling reset() and setReader() at various points.
> However I always end up at some other exception buried deep within, so I
> believe I am still missing some piece of the puzzle.
> 
> Any help greatly appreciated!
> 
> Thanks,
> Steven
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message