Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of uwe@thetaphi.de designates
 188.138.97.18 as permitted sender)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: <1361985921447-4043427.post@n3.nabble.com>
In-Reply-To: <1361985921447-4043427.post@n3.nabble.com>
Subject: RE: Confusion with Analyzer.tokenStream() re-use in 4.1
Date: Wed, 27 Feb 2013 20:02:41 +0100
Message-ID: <000801ce151d$04f011c0$0ed03540$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AQFXS/8tl8Ah2c3tGvgRQBJzVs+AMZl7oCkQ
Content-Language: de

The problem here is that the tokenstream is instantiated in the same =
thread from 2 different code paths and consumed later. If you add =
fields, the indexer will fetch a new reused TokenStream one after each =
other and consume them directly after getting. It will not interleave =
this. In your case, the second field is instantiated using a =
TokenStream, which is already initialized. Unfortunately, if you ask the =
analyzer for another TokenStream later, the already opened one gets =
invalid (the second field).

Don't use new Field(name, TokenStream) with TokenStreams from Analyzers, =
because they are only "valid" for a very short time. If you need to do =
this, use a second Analyzer instance. If you add Fields with a String =
value, the TokenStream is created on they fly  and is be consumed by the =
DocumentsWriter directly after getting it.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Konstantyn Smirnov [mailto:injecteer@yahoo.com]
> Sent: Wednesday, February 27, 2013 6:25 PM
> To: java-user@lucene.apache.org
> Subject: Confusion with Analyzer.tokenStream() re-use in 4.1
>=20
> Dear all,
>=20
> I'm using the following test-code:
>=20
> Document doc =3D new Document()
> Analyzer a =3D new SimpleAnalyzer( Version.LUCENE_41 )
>=20
> TokenStream inputTS =3D a.tokenStream( 'name1', new StringReader( 'aaa =
bbb
> ccc' ) ) Field f =3D new TextField( 'name1', inputTS ) doc.add f
>=20
> TokenStream ts =3D doc.getField( 'name1' ).tokenStreamValue()
> ts.reset()
>=20
> String sb =3D ''
> while( ts.incrementToken() ) sb +=3D ts.getAttribute( =
CharTermAttribute ) + '|'
> assert 'aaa|bbb|ccc|' =3D=3D sb
>=20
> inputTS =3D a.tokenStream( 'name2', new StringReader( 'xxx zzz' ) ) f =
=3D new
> TextField( 'name2', inputTS ) doc.add f
>=20
> TokenStream ts =3D doc.getField( 'name2' ).tokenStreamValue()
> ts.reset()
>=20
> sb =3D ''
> while( ts.incrementToken() ) sb +=3D ts.getAttribute( =
CharTermAttribute ) + '|'
> assert 'xxx|zzz|' =3D=3D sb // << FAILS! -> sb =3D=3D '' and =
ts.incrementTokent() =3D=3D
> false
>=20
> The 1st added field lets read it's tokentStreamValue() tokens, all =
subsequent
> calls bring nothing, unless I re-instantiate the analyzer.
>=20
> Another strange thing is, that just before adding a new field to the
> document, the tokenStream is filled..
>=20
> What am I doing wrong?
>=20
> TIA
>=20
>=20
>=20
>=20
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Confusion-with-Analyzer-
> tokenStream-re-use-in-4-1-tp4043427.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org