lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Confusion with Analyzer.tokenStream() re-use in 4.1
Date Wed, 27 Feb 2013 19:02:41 GMT
The problem here is that the tokenstream is instantiated in the same thread from 2 different
code paths and consumed later. If you add fields, the indexer will fetch a new reused TokenStream
one after each other and consume them directly after getting. It will not interleave this.
In your case, the second field is instantiated using a TokenStream, which is already initialized.
Unfortunately, if you ask the analyzer for another TokenStream later, the already opened one
gets invalid (the second field).

Don't use new Field(name, TokenStream) with TokenStreams from Analyzers, because they are
only "valid" for a very short time. If you need to do this, use a second Analyzer instance.
If you add Fields with a String value, the TokenStream is created on they fly  and is be consumed
by the DocumentsWriter directly after getting it.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Konstantyn Smirnov [mailto:injecteer@yahoo.com]
> Sent: Wednesday, February 27, 2013 6:25 PM
> To: java-user@lucene.apache.org
> Subject: Confusion with Analyzer.tokenStream() re-use in 4.1
> 
> Dear all,
> 
> I'm using the following test-code:
> 
> Document doc = new Document()
> Analyzer a = new SimpleAnalyzer( Version.LUCENE_41 )
> 
> TokenStream inputTS = a.tokenStream( 'name1', new StringReader( 'aaa bbb
> ccc' ) ) Field f = new TextField( 'name1', inputTS ) doc.add f
> 
> TokenStream ts = doc.getField( 'name1' ).tokenStreamValue()
> ts.reset()
> 
> String sb = ''
> while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) + '|'
> assert 'aaa|bbb|ccc|' == sb
> 
> inputTS = a.tokenStream( 'name2', new StringReader( 'xxx zzz' ) ) f = new
> TextField( 'name2', inputTS ) doc.add f
> 
> TokenStream ts = doc.getField( 'name2' ).tokenStreamValue()
> ts.reset()
> 
> sb = ''
> while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) + '|'
> assert 'xxx|zzz|' == sb // << FAILS! -> sb == '' and ts.incrementTokent() ==
> false
> 
> The 1st added field lets read it's tokentStreamValue() tokens, all subsequent
> calls bring nothing, unless I re-instantiate the analyzer.
> 
> Another strange thing is, that just before adding a new field to the
> document, the tokenStream is filled..
> 
> What am I doing wrong?
> 
> TIA
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Confusion-with-Analyzer-
> tokenStream-re-use-in-4-1-tp4043427.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message