Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 33787 invoked from network); 3 Feb 2009 20:55:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Feb 2009 20:55:25 -0000 Received: (qmail 95900 invoked by uid 500); 3 Feb 2009 20:55:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 95871 invoked by uid 500); 3 Feb 2009 20:55:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 95860 invoked by uid 99); 3 Feb 2009 20:55:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Feb 2009 12:55:18 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [87.248.110.137] (HELO n20.bullet.mail.ukl.yahoo.com) (87.248.110.137) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 03 Feb 2009 20:55:09 +0000 Received: from [217.146.182.177] by n20.bullet.mail.ukl.yahoo.com with NNFMP; 03 Feb 2009 20:53:28 -0000 Received: from [87.248.111.150] by t3.bullet.ukl.yahoo.com with NNFMP; 03 Feb 2009 20:53:28 -0000 Received: from [127.0.0.1] by omp207.mail.ukl.yahoo.com with NNFMP; 03 Feb 2009 20:53:28 -0000 X-Yahoo-Newman-Id: 55300.68846.bm@omp207.mail.ukl.yahoo.com Received: (qmail 92346 invoked from network); 3 Feb 2009 20:53:28 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=ih7rfJjN4S2mohXzV6rr12SeOLZd3G5zKv0KNFQczD+HPEgmFPDN/pORs22mU7j2hXjXFNkuqccjt7tulNzWJ3+71V8cAo7wAsu1K0ZnzKpUwPqWkTB/RihwZ8rmXQ1ATOePbLK812+d9+gzFyDfsc+6kE2HZC8ccoVei9hKQJc= ; Received: from unknown (HELO ?192.168.2.50?) (markharw00d@194.106.34.5 with plain) by smtp142.mail.ukl.yahoo.com with SMTP; 3 Feb 2009 20:53:27 -0000 X-YMail-OSG: MJky4CkVM1narKejaGnDcBAxM9Px8SRIiQxX3.wT_EsJxdiLKdYcyu4cV_I_lKFt.Ktn4dREO9aHkktpxhMX_f0zFg4cta0USMxX9RKVlbscZt9ShQpx2eL2nP310uOil8ZY7FloWTOmakOcg7Nyauu7O.8iOu0aIek1qDKzmewXT7v6pADvmUFRlwOGzw-- X-Yahoo-Newman-Property: ymail-3 Message-ID: <4988AEA1.2060805@yahoo.co.uk> Date: Tue, 03 Feb 2009 20:52:49 +0000 From: markharw00d User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Poor QPS with highlighting References: <652776090902022324g7af80223pa7a4c3fe396c5b59@mail.gmail.com> <7770.55606.qm@web26002.mail.ukl.yahoo.com> <652776090902031140h370cb16fo600d53c5b7924a03@mail.gmail.com> In-Reply-To: <652776090902031140h370cb16fo600d53c5b7924a03@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org > Can you describe this in a little more detail; I'm not exactly sure what you > mean. > Break your large text documents into multiple Lucene documents. Rather than dividing them up into entirely discreet chunks of text consider storing/indexing *overlapping* sections of text with an overlap as big as the largest "slop" factor you use on Phrase/Span queries so that you don't cut any potential phrases in half and fail to match e.g. This non-overlapping indexing scheme will not match a search for "George Bush" Doc 1 = ".... outgoing president George " Doc 2= "Bush stated that ..." While this overlapping scheme will match... Doc 1 = ".... outgoing president George " Doc 2= "president George Bush stated that ..." This fragmenting approach helps avoid the performance cost of highlighting very large documents. The remaining issue is to remove duplicates in your search results when you match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input Doc#1 and will match a search for "president". You will need to store a field for the "original document number" and remove any duplicates (or merge them in the display if that is what is required). Cheers, Mark --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org