lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teruhiko Kurosaka" <K...@basistech.com>
Subject RE: How do TeeTokenizer and SinkTokenizer work?
Date Mon, 25 Aug 2008 23:29:37 GMT
Thank you, Grant and (Koji) Sekiguchi-san.


> but I don't 
> > understand how the input from reader1 and reader2 are mixed 
> together.
> > Will sink1 first reaturn the reader1 text, and reader2?
> 
> It depends on the order the fields are added.  If source1 is 
> used first, then reader1 will be first.

This puzzles me.  Is this really useful if how SinkTokenizer
and TeeTokenizer behave depends on how they are read?
I've read the source code of these Tokenizers but that
didn't solve my question.

This is an excerpt from Sekiguchi-san's code sample:

	Analyzer analyzer = new Analyzer() {

		public TokenStream tokenStream(String field, Reader in) {
				return new TeeTokenFilter( 
					new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ), 
								sinkPerson ), sinkOrg );
		}
	};

	TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
	TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
	IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
	Document doc = new Document();
	doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
	doc.add( new Field( F_PERSON, exPerson ) );
	doc.add( new Field( F_ORG, exOrg ) );
	writer.addDocument( doc );

It seems that the code works as expected only if the token stream from
the analyzer on CONTENT is read completely, then the token stream from
sinkPerson is read compeltely, followed by that from sinkOrg.

Does Lucene's core gurantees that a field's token stream is read completely
before the next field's token stream is read, in the order the Field's are add()'ed?

- Kuro

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message