lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: How do TeeTokenizer and SinkTokenizer work?
Date Tue, 26 Aug 2008 13:14:36 GMT

On Aug 25, 2008, at 7:29 PM, Teruhiko Kurosaka wrote:

> Thank you, Grant and (Koji) Sekiguchi-san.
>
>
>> but I don't
>>> understand how the input from reader1 and reader2 are mixed
>> together.
>>> Will sink1 first reaturn the reader1 text, and reader2?
>>
>> It depends on the order the fields are added.  If source1 is
>> used first, then reader1 will be first.
>
> This puzzles me.  Is this really useful if how SinkTokenizer
> and TeeTokenizer behave depends on how they are read?

Fields in a Document are added as a List, so the Field ordering is  
always the same.

>
> I've read the source code of these Tokenizers but that
> didn't solve my question.
>
> This is an excerpt from Sekiguchi-san's code sample:
>
> 	Analyzer analyzer = new Analyzer() {
>
> 		public TokenStream tokenStream(String field, Reader in) {
> 				return new TeeTokenFilter(
> 					new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ),
> 								sinkPerson ), sinkOrg );
> 		}
> 	};
>
> 	TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
> 	TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
> 	IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
> 	Document doc = new Document();
> 	doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
> 	doc.add( new Field( F_PERSON, exPerson ) );
> 	doc.add( new Field( F_ORG, exOrg ) );
> 	writer.addDocument( doc );
>
> It seems that the code works as expected only if the token stream from
> the analyzer on CONTENT is read completely, then the token stream from
> sinkPerson is read compeltely, followed by that from sinkOrg.
>
> Does Lucene's core gurantees that a field's token stream is read  
> completely
> before the next field's token stream is read, in the order the  
> Field's are add()'ed?

Yes, it processes all of one Field first, then the next one.  If it  
doesn't, then we have a bug, IMO, and we will have to have a different  
approach for the Tee/Sink.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message