Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Date: Thu, 26 Feb 2004 22:58:49 GMT
Message-Id: <200402262258.i1QMwnsT021484@server0027.freedom2surf.net>
From: markharw00d@yahoo.co.uk
To: lucene-dev@jakarta.apache.org
Subject: Re: Dmitry's Term Vector stuff, plus some

>>Another approach that someone mentioned for solving this problem is to create a fragment index for long documents.

Alternatively, could you use term sequence positions to guess where to start extracting text from the doc?
If you have identified the best section of the doc based purely on identifying clusters of  term positions you can then identify a minumum offset into the doc 
based on summing all of the preceding term text lengths. This offset could be used to avoid  tokenizing all the preamble and you would simply need to tokenize 
from the chosen offset until you had identified the run of terms that matched your best cluster sequence.
I'm not sure if the TermVector support provides the necessary APIs to take this approach?


a run of terms that matched the 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org