lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Baxter <>
Subject Term highlighting
Date Wed, 14 May 2003 12:17:13 GMT
I have been looking at implementing highlighting of the terms in the 
documents returned by Lucene. I'd rather not have to retokenize the 
document on-the-fly in order to locate the terms, since this is slow 
and wasteful as lucene already has the term-location information (at 
least lucene stores the index of the term locations in the document, 
which can be turned into a character offset provided you store the 
mapping from token positions to character offsets somewhere else - eg 
as an unindexed field). 

Looking under the hood, it seems from the source that in order to 
extract the term location information for a specific document one 
would need to scan the ".prx" file sequentially starting at the 
offset in the file of the term, until the document number is found. 
This probably wouldn't be necessary for a phrase query, since in that 
case the .prx file is already being scanned, and so one could just 
save a pointer to the start of the location information for each term 
in the phrase for each hit. 

However, for boolean queries, it is the ".frq" file that is scanned 
not the ".prx" file, so there isn't anywhere to get the location 
information without rescanning the ".prx" file after finding all the 

So, my question(s):

- have I missed something obvious and in fact there is a simple way to 
extract term-location information for a specific document from the 
lucene index?

- if not, would it be horribly slow to try and do it post-facto after 
hits have been found by scanning through the ".prx" file from the 
start of the information for each term in the query?

- if the answer to the second question is "yes - horribly slow", would 
it make sense then to add an extra field to each entry in the ".frq" 
file indicating where the location information for the term and 
document is in the ".prx" file (ie, the .frq file info for each term 
would consist of a series of <doc_num, freq, prx_pointer_offset> 
triples where prx_pointer_offset gives the number of bytes to skip in 
the .prx file to get to the location information for the specified 
document)? The prx_pointer_offset could then be used in a boolean 
query to compute pointers for each hit indicating where in the .prx 
file the location information for each term starts. 



Jonathan Baxter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message