incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Highlighter excerpt boundaries
Date Thu, 19 Jan 2012 02:28:00 GMT
(Moving this thread from the issue tracker to the dev list because it's now
about an approach rather than a specific patch...)

On Wed, Jan 18, 2012 at 10:06:41PM +0000, Nick Wellnhofer (Commented) (JIRA) wrote:
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734
] 

> Thinking more about a better fix for this problem, it's important to note
> that choosing a good excerpt is an operation that can be done without
> knowledge of the actual tokenization algorithm used in the indexing process.

There are multiple phases involved:

  1. Identify sections of text that contain relevant material -- i.e. that
     contributed to the search-time score of the document.
  2. Pick one contiguous chunk of text which seems to contain a lot of
     relevant material.
  3. Choose precise start and end points for the excerpt.

Phase 1 actually *does* require knowledge of the tokenization algorithm.  We
delegate creation of the HeatMap to our Query classes (technically, our
"Compiler" weighted Query classes).  They only handle granularity down to the
level of a token, so we need to provide them with a mapping of token-number =>
[start-offset,end-offset] in order to generate a HeatMap containing Spans
measured in code-point offsets; these code-point offsets are later used when
inserting highlight tags.

In our present implementation, however, offset information is captured at
index-time (via HighlightWriter), so our Highlighter objects don't technically
need to know about the tokenization algo (as encapsulated in the highlight
field's Analyzer).

Phase 2 does not require knowledge of the tokenization algo.

Phase 3 can be implemented several different ways.  It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.

The present implementation uses improvised sentence boundary detection then
falls back to whitespace -- and then, after your recent patch, to truncation.
IMO, it would be nice to clean up the sentence boundary detection to use the
algo described in UAX #29 instead of the current naive hack.

The remaining question is what to do when sentence boundary detection fails.
We can continue to fall back to whitespace, which works for plain text but
doesn't work well for e.g. URLs.  I think it might make sense to fall back to
the field's tokenization algorithm; we might also consider falling back to a
fixed choice of StandardTokenizer.  Both techniques will work well most of the
time but not all of the time.

> Such an approach wouldn't depend on the analyzer at all and it wouldn't
> introduce additional coupling of Lucy's components. 

Not sure what I'm missing, but I don't understand the "coupling" concern.  It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.

I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result.  Separation of concerns FTW!

> Of course, it would mean to implement a separate Unicode-capable word
> breaking algorithm for the highlighter. But this shouldn't be very hard as
> we could reuse parts of the StandardTokenizer.

IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.

Marvin Humphrey


Mime
View raw message