I have been looking at implementing highlighting of the terms in the documents returned by Lucene. I'd rather not have to retokenize the document on-the-fly in order to locate the terms, since this is slow and wasteful as lucene already has the term-location information (at least lucene stores the index of the term locations in the document, which can be turned into a character offset provided you store the mapping from token positions to character offsets somewhere else - eg as an unindexed field). Looking under the hood, it seems from the source that in order to extract the term location information for a specific document one would need to scan the ".prx" file sequentially starting at the offset in the file of the term, until the document number is found. This probably wouldn't be necessary for a phrase query, since in that case the .prx file is already being scanned, and so one could just save a pointer to the start of the location information for each term in the phrase for each hit. However, for boolean queries, it is the ".frq" file that is scanned not the ".prx" file, so there isn't anywhere to get the location information without rescanning the ".prx" file after finding all the hits. So, my question(s): - have I missed something obvious and in fact there is a simple way to extract term-location information for a specific document from the lucene index? - if not, would it be horribly slow to try and do it post-facto after hits have been found by scanning through the ".prx" file from the start of the information for each term in the query? - if the answer to the second question is "yes - horribly slow", would it make sense then to add an extra field to each entry in the ".frq" file indicating where the location information for the term and document is in the ".prx" file (ie, the .frq file info for each term would consist of a series of triples where prx_pointer_offset gives the number of bytes to skip in the .prx file to get to the location information for the specified document)? The prx_pointer_offset could then be used in a boolean query to compute pointers for each hit indicating where in the .prx file the location information for each term starts. Thanks, Jonathan -- Jonathan Baxter jbaxter@panscient.com --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org