lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject BufferingAnalyzer (or something like that)
Date Thu, 08 Nov 2007 01:08:53 GMT
 From time to time, I have run across analysis problems where I want  
to only analyze a particular field once, but I also want to "pluck"  
certain tokens (one or more) out of the stream and then use them as  
the basis for another field.  For example, say I have a token filter  
that can identify proper names and I also want a field that contains  
all the tokens.  Currently, the way to do this is to analyze your  
content for the whole field and then reanalyze the field for the  
proper names.  Essentially do what Solr's copyField does.  Another use  
case, potentially, is when there are two fields, one that is  
lowercased and one that isn't.  In this case, you could do all the  
analysis, then have the last filter set aside the tokens before they  
are lower-cased (or vice versa) and then when it comes to indexing the  
lower-cased field, Lucene just needs to spit back out the token buffer.

This has always struck me as wasteful especially given a complex  
analysis stream.  What I am thinking of doing is injecting a  
TokenFilter that can buffer these tokens and then that TokenFilter can  
be shared by the Analyzer when it is time to analyze the other field.   
Obviously, there are memory issues that need to be managed/documented,  
but I think they could be controlled by the application.  For example,  
there likely isn't a lot of proper nouns in a given document such that  
it would be a huge memory footprint.  Unless the "filtering"  
TokenFilter is also expanding and adding other tokens, I would guess  
most use cases in the worst case would use up as much memory as the  
original field analysis.  At any rate, some filter implementations  
could be designed to control memory and discard when full or something  
like that.

The CachingTokenFilter kind of does this, but it doesn't allow for  
modifications and always gives you those same tokens back.  It also  
seems like the new Field.tokenStreamValue() and TokenStream based  
constructor might help, but you have the whole construction problem.   
I suppose you could "pre-analyze" the content and then make both  
Fields based on that pre-analysis,

I currently have two different approaches to this.  The first is a  
CachedAnalyzer and CachedTokenizer implementation that takes in a List  
of tokens.  The other is an abstract Analyzer that coordinates the  
handoff of the buffer created by the first TokenStream and gives it to  
the second.  The first requires that you do the looping on the  
TokenStream in the application outside of Lucene, the latter lets  
Lucene what it normally does.

Anyone have any thoughts on this?  Is this useful (i.e. should I add  
it in)?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message