lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw...@yahoo.co.uk
Subject Re: Dmitry's Term Vector stuff, plus some
Date Thu, 26 Feb 2004 22:58:49 GMT
>>Another approach that someone mentioned for solving this problem is to create a fragment
index for long documents.

Alternatively, could you use term sequence positions to guess where to start extracting text
from the doc?
If you have identified the best section of the doc based purely on identifying clusters of
 term positions you can then identify a minumum offset into the doc 
based on summing all of the preceding term text lengths. This offset could be used to avoid
 tokenizing all the preamble and you would simply need to tokenize 
from the chosen offset until you had identified the run of terms that matched your best cluster
sequence.
I'm not sure if the TermVector support provides the necessary APIs to take this approach?


a run of terms that matched the 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message