lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Analyzer and Fieldable, different stored and indexed values
Date Wed, 27 Aug 2008 15:15:06 GMT
If I'm understanding correctly...

What about a SinkTokenizer that is backed by a Reader/Field instead of  
the current one that stores it all in a List?  This is more or less  
the use case for the Tee/Sink implementations, w/ the exception that  
we didn't plan for the Sink being too large, but that is easily  
overcome, IMO.

That is, you use a TeeTokenFilter that adds to your Sink, which  
serializes to some storage, and then your SinkTokenizer just  
unserializes.  No need to change Fieldable at all or anything else

Or maybe just a Tokenizer that is backed by a Field would work and  
uses a TermEnum on the Field to serve up next() for the TokenStream.

Just thinking out loud...


On Aug 27, 2008, at 10:47 AM, Andrzej Bialecki wrote:

> Hi all,
> I recently had a situation where I had to pass some metadata  
> information to Analyzer. This metadata was specific to a Document  
> instance (short story is that the analysis of some fields depended  
> on data coming from other fields, and the number of possible values  
> was too big to use separate fields for each combination).
> It would be nice to have an Analyzer.tokenStream(String fieldName,  
> Field f), or even better tokenStream(String fieldName, Document  
> doc) ... but probably it's too intrusive to change this. Although I  
> would be happy to have tokenStream(String, Fieldable), because then  
> I could provide my own Fieldable with metadata.
> In the meantime, having neither option, I came up with an idea: I  
> will use a subclass of Reader, and attach my metadata there, and  
> then use this Reader when creating a Field. However, I quickly  
> discovered that if you set a Reader on a Field, this field  
> automatically becomes un-stored - not what I wanted ... Field is  
> declared final, so no luck there.
> In the end I implemented a Fieldable, which sort of breaks the  
> contract for Fieldable - but it works :) . Namely, my Fieldable  
> returns both readerValue() and stringValue(). The first method  
> returns my subclass of Reader with metadata, and the second returns  
> the value to be stored.
> The reason why it works is that DocInverterPerField first checks the  
> tokenStreamValue, then the readerValue, and only then the  
> stringValue that it converts to a Reader - so in my case it uses the  
> supplied readerValue. At the same time, FieldsWriter, which is  
> responsible for storing field values, uses just the stringValue (or  
> binaryValue, but that wasn't relevant to my case), which is also set  
> to non-null.
> So, here are my thoughts on this, and I'd appreciate any comments on  
> this:
> * is this a justified use of the API? it works, at least at the  
> moment ;) and I couldn't find any other way to accomplish this task.
> * could we perhaps relax the restriction on Fieldable so that it can  
> return non-null values from more than one method, and clearly  
> document in what sequence they are processed? This is already hinted  
> at in the javadoc.
> * I propose to add a new API to Analyzer:
>  public TokenStream tokenStream(String fieldName, Fieldable field);
> to support use cases like the one I described above. The default  
> implementation could be something like this:
>  public TokenStream tokenStream(String fieldName, Fieldable field) {
> 	Reader r = field.readerValue();
> 	if (r == null) {
> 		String s = field.stringValue();
> 		r = new StringReader(s);
> 	}
> 	return tokenStream(fieldName, r);
>  }
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message