lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Faster highlighting with TermPositionVectors
Date Thu, 04 Nov 2004 10:13:48 GMT

This is great stuff!

One quick comment just at my look at the code (I haven't tried it yet). 
  Shouldn't the tpv variable be used in this method?

     public static TokenStream getAnyTokenStream(IndexReader reader,int 
docId, String field,Analyzer analyzer) throws IOException
		TokenStream ts=null;

		TermFreqVector tfv=(TermFreqVector) 
		    if(tfv instanceof TermPositionVector)
		        //the most efficient choice..
 >>>		        TermPositionVector tpv=(TermPositionVector) 
		//No token info stored so fall back to analyzing raw content
		return ts;


On Oct 28, 2004, at 7:16 PM, wrote:

> Thanks to the recent changes (see CVS) in TermFreqVector support we 
> can now make use of term offset information held
> in the Lucene index rather than incurring the cost of re-analyzing 
> text to highlight it.
> I have created a  class ( see 
> ) which handles 
> creating
> a TokenStream from the TermPositionVector stored in the database which 
> can then be passed to the highlighter.
> This approach is significantly faster than re-parsing the original 
> text.
> If people are happy with this class I'll add it to the Highlighter 
> sandbox but it may sit better elsewhere in the Lucene code base
> as a more general purpose utility.
> BTW as part of putting this together I found that the TermFreq code 
> throws a null pointer when indexing fields
> that produce no tokens (ie empty or all stopwords). Otherwise things 
> work very well.
> Cheers
> Mark
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message