lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martijn <>
Subject Re: retrieve tokens
Date Wed, 22 Dec 2004 20:44:18 GMT
Erik Hatcher wrote:

> On Dec 22, 2004, at 12:43 PM, M. Smit wrote:
> Consider that you're only highlighting 20 or so entries at one time.  
> Getting the text from a Lucene index you're already navigating will be 
> quite quick.  But it shouldn't be too bad to pull 20 records from a 
> database either.
> There is one other consideration, and that is to use the new (CVS 
> only) feature of capturing term vectors with position information.  
> The author of the Highlighter, Mark Harwood, has posted in the not too 
> distant past, an update to the Highlighter that can use this position 
> information for highlighting rather than re-analyzing the original 
> text.  The re-analysis of the text may be the bottleneck, not the 
> database access.

So indeed the fabrication of the snippet would have to traverse the 200 
pages of plain text per hit. I don't think that pulling the records from 
the database would be the issue..  But I'm somewhat intriqued by the 
options of leaving out the database altogether (for the presentation of 
the hits that is) and store this plain text representation of the pdf in 
Lucene. My worries: how does size (the extra size I would need to store 
the plain text) effect Lucene operations.. Is the order of  N.N or like 

31.69 nHz = once a year

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message