On Jun 3, 2009, at 10:25 PM, Marvin Humphrey wrote: > Right now in the KS implementation, sentence boundary information is > calculated on the fly at runtime, via Highlighter_Find_Sentences(). > However, > this seems wasteful, because sentence boundaries can be known at > index-time. > Perhaps we ought to be storing sentence boundary information in the > index. Would you extend the Analysis interface to allow for custom sentence algorithms? Could the sentences be numbered, so the final fragment has information about *which* sentence it came from? (I could use this for pagination.) > Perhaps if each Span were to include a reference to the original > Query object > which produced it? These would be primitives such as TermQuery and > PhraseQuery rather than compound queries like ANDQuery. Would that > reference > be enough to implement a preference for term diversity in the > excerpting algo? There is one scenario I can think of where that *might* not work. If someone searches for a list of keywords that includes the same keyword twice (e.g., I sometimes copy and paste a sentence to find documents with similar content), then there will be two TermQueries that are identical but considered different. Maybe this won’t matter because the duplicate term should have extra weight. I haven’t thought this through. > And might that information come in handy for other excerpting algos? As long as the supplied Term/PhraseQuery is the original object, and not a clone, I think it would.