lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Beard, Brian" <>
Subject RE: Poor QPS with highlighting
Date Thu, 05 Feb 2009 14:58:12 GMT
A while ago someone posted a link to a project called XTF which does

The one problem with this approach still lurking for me (or maybe I
don't understand how to get around) is how to handle multiple terms
which "must" appear in the query, but are in non-overlapping chunks. You
would want a hit, but the hit won't come back because both terms don't
appear in the same document.

Has anyone given any thought to this?  

-----Original Message-----
From: markharw00d [] 
Sent: Tuesday, February 03, 2009 3:53 PM
Subject: Re: Poor QPS with highlighting

> Can you describe this in a little more detail; I'm not exactly sure
what you
> mean.

Break your large text documents into multiple Lucene documents. Rather 
than dividing them up into entirely discreet chunks of text consider 
storing/indexing *overlapping* sections of text with an overlap as big 
as the largest "slop" factor you use on Phrase/Span queries so that you 
don't cut any potential phrases in half and fail to match e.g.

This non-overlapping indexing scheme will not match a search for "George


    Doc 1 = "....  outgoing president George "
    Doc 2=  "Bush stated that ..."

While this overlapping scheme will match...
    Doc 1 = "....  outgoing president George "
    Doc 2=  "president George Bush stated that ..."

This fragmenting approach helps avoid the performance cost of 
highlighting very large documents.

The remaining issue is to remove duplicates in your search results when 
you match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input

Doc#1 and will match a search for "president". You will need to store a 
field for the "original document number" and remove any duplicates (or 
merge them in the display if that is what is required).


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message