incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Excerpting algos
Date Thu, 04 Jun 2009 05:25:34 GMT

I'm currently contemplating how to implement a Highlighter along the lines of
what's described in JIRA issue LUCENE-1522.

The tricky part of highlighting is selecting one or more excerpts.  Unless we
end up with a very strange excerpting algorithm that loses positional
information, applying the actual highlighting tags at the end will be a
straightforward, mechanical process.

To select the best fragments, we will create objects representing candidates
and assemble them in a PriorityQueue.  The queue's Less_Than() method will
favor "better" fragments.  Whether Less_Than() performs real analysis on the
fly, or whether we score the fragments in advance and have Less_Than simply
compare fixed scores has yet to be determined, but the important point is that
we must design our OO hierarchy and fragment ranking infrastructure to support
as many excerpting algorithms as possible.

There will be two main sources of input to the fragment ranker.

  * Information from the Query about where it matched.
  * Sentence boundary information.

Right now in the KS implementation, sentence boundary information is
calculated on the fly at runtime, via Highlighter_Find_Sentences().  However,
this seems wasteful, because sentence boundaries can be known at index-time.
Perhaps we ought to be storing sentence boundary information in the index.

Information from the query about where it matched can be gleaned from the
weighted-query class.   In Lucene, this class goes by the truly dreadful name
of "Weight"; in KinoSearch, it goes by the still-dreadful-but-different name
"Compiler".  We need to use Weight/Compiler rather than Query because some
algorithms (e.g. current KS) depend on IDF, which is only known after
weighting the Query against a given collection of documents to produce a

The current KS implementation, Compiler's Highlight_Spans() method, returns a
VArray of Span objects, each of which has an offset (measured in Unicode code
points from the top of the field), a length (also a count of Unicode code
points), and a floating point "weight".  Compound queries such as ANDQuery and
ORQuery simply produce a VArray which unions the Spans produced by their
child nodes.

I have an intuitive feeling that an array of weighted score spans will be
useful in other contexts besides highlighting, and the method is pretty easy
to grok.  However, this simple algo has a flaw: it's not clear what part of
the query produced each score span.  Some algorithms will want to influence
selection of fragments based on term diversity, so that for example, multiple
fragments would be preferred over a single fragment if they represented
different parts of the query.

Perhaps if each Span were to include a reference to the original Query object
which produced it?  These would be primitives such as TermQuery and
PhraseQuery rather than compound queries like ANDQuery.  Would that reference
be enough to implement a preference for term diversity in the excerpting algo?
And might that information come in handy for other excerpting algos?

Marvin Humphrey

View raw message