incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Father Chrysostomos <>
Subject Re: Excerpting algos
Date Fri, 05 Jun 2009 21:42:46 GMT

On Jun 3, 2009, at 10:25 PM, Marvin Humphrey wrote:

> Right now in the KS implementation, sentence boundary information is
> calculated on the fly at runtime, via Highlighter_Find_Sentences().   
> However,
> this seems wasteful, because sentence boundaries can be known at  
> index-time.
> Perhaps we ought to be storing sentence boundary information in the  
> index.

Would you extend the Analysis interface to allow for custom sentence  
algorithms? Could the sentences be numbered, so the final fragment has  
information about *which* sentence it came from? (I could use this for  

> Perhaps if each Span were to include a reference to the original  
> Query object
> which produced it?  These would be primitives such as TermQuery and
> PhraseQuery rather than compound queries like ANDQuery.  Would that  
> reference
> be enough to implement a preference for term diversity in the  
> excerpting algo?

There is one scenario I can think of where that *might* not work. If  
someone searches for a list of keywords that includes the same keyword  
twice (e.g., I sometimes copy and paste a sentence to find documents  
with similar content), then there will be two TermQueries that are  
identical but considered different. Maybe this won’t matter because  
the duplicate term should have extra weight. I haven’t thought this  

> And might that information come in handy for other excerpting algos?

As long as the supplied Term/PhraseQuery is the original object, and  
not a clone, I think it would.

View raw message