lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: BufferingAnalyzer (or something like that)
Date Mon, 19 Nov 2007 13:53:10 GMT

On Nov 8, 2007, at 7:14 AM, Mark Miller wrote:

> I think it is certainly useful as I use something similar myself. My  
> implementation is not as generic as I would like (requires a  
> specific special analyzer written for the task), but works great for  
> my case. I use a CachingTokenFilter as well as a couple ThreadLocals  
> so that I can have a stemmed and non stemmed index without having to  
> analyze twice. It saves me plenty in my benchmarks. A generic  
> solution would be awesome.
> - Mark
> Grant Ingersoll wrote:
>> From time to time, I have run across analysis problems where I want  
>> to only analyze a particular field once, but I also want to "pluck"  
>> certain tokens (one or more) out of the stream and then use them as  
>> the basis for another field.  For example, say I have a token  
>> filter that can identify proper names and I also want a field that  
>> contains all the tokens.  Currently, the way to do this is to  
>> analyze your content for the whole field and then reanalyze the  
>> field for the proper names.  Essentially do what Solr's copyField  
>> does.  Another use case, potentially, is when there are two fields,  
>> one that is lowercased and one that isn't.  In this case, you could  
>> do all the analysis, then have the last filter set aside the tokens  
>> before they are lower-cased (or vice versa) and then when it comes  
>> to indexing the lower-cased field, Lucene just needs to spit back  
>> out the token buffer.
>> This has always struck me as wasteful especially given a complex  
>> analysis stream.  What I am thinking of doing is injecting a  
>> TokenFilter that can buffer these tokens and then that TokenFilter  
>> can be shared by the Analyzer when it is time to analyze the other  
>> field.  Obviously, there are memory issues that need to be managed/ 
>> documented, but I think they could be controlled by the  
>> application.  For example, there likely isn't a lot of proper nouns  
>> in a given document such that it would be a huge memory footprint.   
>> Unless the "filtering" TokenFilter is also expanding and adding  
>> other tokens, I would guess most use cases in the worst case would  
>> use up as much memory as the original field analysis.  At any rate,  
>> some filter implementations could be designed to control memory and  
>> discard when full or something like that.
>> The CachingTokenFilter kind of does this, but it doesn't allow for  
>> modifications and always gives you those same tokens back.  It also  
>> seems like the new Field.tokenStreamValue() and TokenStream based  
>> constructor might help, but you have the whole construction  
>> problem.  I suppose you could "pre-analyze" the content and then  
>> make both Fields based on that pre-analysis,
>> I currently have two different approaches to this.  The first is a  
>> CachedAnalyzer and CachedTokenizer implementation that takes in a  
>> List of tokens.  The other is an abstract Analyzer that coordinates  
>> the handoff of the buffer created by the first TokenStream and  
>> gives it to the second.  The first requires that you do the looping  
>> on the TokenStream in the application outside of Lucene, the latter  
>> lets Lucene what it normally does.
>> Anyone have any thoughts on this?  Is this useful (i.e. should I  
>> add it in)?
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message