lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: BufferingAnalyzer (or something like that)
Date Thu, 08 Nov 2007 12:14:15 GMT
I think it is certainly useful as I use something similar myself. My 
implementation is not as generic as I would like (requires a specific 
special analyzer written for the task), but works great for my case. I 
use a CachingTokenFilter as well as a couple ThreadLocals so that I can 
have a stemmed and non stemmed index without having to analyze twice. It 
saves me plenty in my benchmarks. A generic solution would be awesome.

- Mark

Grant Ingersoll wrote:
> From time to time, I have run across analysis problems where I want to 
> only analyze a particular field once, but I also want to "pluck" 
> certain tokens (one or more) out of the stream and then use them as 
> the basis for another field.  For example, say I have a token filter 
> that can identify proper names and I also want a field that contains 
> all the tokens.  Currently, the way to do this is to analyze your 
> content for the whole field and then reanalyze the field for the 
> proper names.  Essentially do what Solr's copyField does.  Another use 
> case, potentially, is when there are two fields, one that is 
> lowercased and one that isn't.  In this case, you could do all the 
> analysis, then have the last filter set aside the tokens before they 
> are lower-cased (or vice versa) and then when it comes to indexing the 
> lower-cased field, Lucene just needs to spit back out the token buffer.
> This has always struck me as wasteful especially given a complex 
> analysis stream.  What I am thinking of doing is injecting a 
> TokenFilter that can buffer these tokens and then that TokenFilter can 
> be shared by the Analyzer when it is time to analyze the other field.  
> Obviously, there are memory issues that need to be managed/documented, 
> but I think they could be controlled by the application.  For example, 
> there likely isn't a lot of proper nouns in a given document such that 
> it would be a huge memory footprint.  Unless the "filtering" 
> TokenFilter is also expanding and adding other tokens, I would guess 
> most use cases in the worst case would use up as much memory as the 
> original field analysis.  At any rate, some filter implementations 
> could be designed to control memory and discard when full or something 
> like that.
> The CachingTokenFilter kind of does this, but it doesn't allow for 
> modifications and always gives you those same tokens back.  It also 
> seems like the new Field.tokenStreamValue() and TokenStream based 
> constructor might help, but you have the whole construction problem.  
> I suppose you could "pre-analyze" the content and then make both 
> Fields based on that pre-analysis,
> I currently have two different approaches to this.  The first is a 
> CachedAnalyzer and CachedTokenizer implementation that takes in a List 
> of tokens.  The other is an abstract Analyzer that coordinates the 
> handoff of the buffer created by the first TokenStream and gives it to 
> the second.  The first requires that you do the looping on the 
> TokenStream in the application outside of Lucene, the latter lets 
> Lucene what it normally does.
> Anyone have any thoughts on this?  Is this useful (i.e. should I add 
> it in)?
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message