Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 40443 invoked from network); 26 Feb 2004 22:59:00 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 26 Feb 2004 22:59:00 -0000 Received: (qmail 25113 invoked by uid 500); 26 Feb 2004 22:58:44 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 25098 invoked by uid 500); 26 Feb 2004 22:58:44 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 25081 invoked from network); 26 Feb 2004 22:58:43 -0000 Received: from unknown (HELO server0027.freedom2surf.net) (194.106.33.36) by daedalus.apache.org with SMTP; 26 Feb 2004 22:58:43 -0000 Received: from dell ([194.106.34.5]) by server0027.freedom2surf.net (8.12.6/8.12.6/Debian-7) with SMTP id i1QMwnsT021484 for ; Thu, 26 Feb 2004 22:58:49 GMT Date: Thu, 26 Feb 2004 22:58:49 GMT Message-Id: <200402262258.i1QMwnsT021484@server0027.freedom2surf.net> From: markharw00d@yahoo.co.uk To: lucene-dev@jakarta.apache.org Subject: Re: Dmitry's Term Vector stuff, plus some X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N >>Another approach that someone mentioned for solving this problem is to create a fragment index for long documents. Alternatively, could you use term sequence positions to guess where to start extracting text from the doc? If you have identified the best section of the doc based purely on identifying clusters of term positions you can then identify a minumum offset into the doc based on summing all of the preceding term text lengths. This offset could be used to avoid tokenizing all the preamble and you would simply need to tokenize from the chosen offset until you had identified the run of terms that matched your best cluster sequence. I'm not sure if the TermVector support provides the necessary APIs to take this approach? a run of terms that matched the --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org