Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 80913 invoked from network); 25 Feb 2004 22:13:52 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 25 Feb 2004 22:13:52 -0000 Received: (qmail 17888 invoked by uid 500); 25 Feb 2004 22:13:39 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 17669 invoked by uid 500); 25 Feb 2004 22:13:37 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 17656 invoked from network); 25 Feb 2004 22:13:37 -0000 Received: from unknown (HELO rwcrmhc12.comcast.net) (216.148.227.85) by daedalus.apache.org with SMTP; 25 Feb 2004 22:13:37 -0000 Received: from apache.org (c-24-5-145-151.client.comcast.net[24.5.145.151]) by comcast.net (rwcrmhc12) with ESMTP id <20040225221343014001uucde>; Wed, 25 Feb 2004 22:13:44 +0000 Message-ID: <403D1E16.80704@apache.org> Date: Wed, 25 Feb 2004 14:13:42 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Dmitry's Term Vector stuff, plus some References: <200402252159.i1PLx9sT006663@server0027.freedom2surf.net> In-Reply-To: <200402252159.i1PLx9sT006663@server0027.freedom2surf.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N markharw00d@yahoo.co.uk wrote: > Bruce, > Could a short term ( and possibly compromised )solution to your performance problem be to offer only the first 3k of these large 200k docs to > the highlighter in order to minimize the amount of tokenization required? Arguably the most relevant bit of a document is typically in the first 1k anyway? Or perhaps the highlighter could be changed to stop tokenizing a document after 1000 tokens when enough fragments have been found to produce a summary. That way, if there are hits in the first part of the document, which there probably usually are for high-scoring hits, then the time to compute the summary is bounded by something less than the document size. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org