lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Stoppelman <>
Subject Re: Poor QPS with highlighting
Date Wed, 04 Feb 2009 08:48:40 GMT
Thanks Mark for the explanation. I think your solution would definitely
change the tf-idf scoring for documents since your field is now split up
over multiple docs.  One option to get around the changing scoring would be
to to run a completely separate index for highlighting (with the overlapping
docs you described). It still seems like storing the offsets would be the
most efficient solution since I wouldn't need a new service to do the


On Tue, Feb 3, 2009 at 12:52 PM, markharw00d <>wrote:

>  Can you describe this in a little more detail; I'm not exactly sure what
>> you
>> mean.
> Break your large text documents into multiple Lucene documents. Rather than
> dividing them up into entirely discreet chunks of text consider
> storing/indexing *overlapping* sections of text with an overlap as big as
> the largest "slop" factor you use on Phrase/Span queries so that you don't
> cut any potential phrases in half and fail to match e.g.
> This non-overlapping indexing scheme will not match a search for "George
> Bush"
>   Doc 1 = "....  outgoing president George "
>   Doc 2=  "Bush stated that ..."
> While this overlapping scheme will match...
>   Doc 1 = "....  outgoing president George "
>   Doc 2=  "president George Bush stated that ..."
> This fragmenting approach helps avoid the performance cost of highlighting
> very large documents.
> The remaining issue is to remove duplicates in your search results when you
> match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input Doc#1
> and will match a search for "president". You will need to store a field for
> the "original document number" and remove any duplicates (or merge them in
> the display if that is what is required).
> Cheers,
> Mark
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message