lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] Highlighter excerpt boundaries
Date Thu, 19 Jan 2012 10:43:59 GMT
On 19/01/2012 03:28, Marvin Humphrey wrote:
> Phase 3 can be implemented several different ways.  It *could* reuse the
> original tokenization algo on its own, but that would produce sub-standard
> results because Lucy's tokenization algos are generally concerned with words
> rather than sentences, and excerpts chosen on word boundaries alone don't look
> very good.

You're right. I was only talking about Phase 3.

>> Such an approach wouldn't depend on the analyzer at all and it wouldn't
>> introduce additional coupling of Lucy's components.
> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
> seems to me as though it would be desirable code re-use to wrap our sentence
> boundary detection mechanism within a battle-tested design like Analyzer,
> rather than do something ad-hoc.

The analyzers are designed so split a whole string into tokens. In the 
highlighter we only need to find a single boundary near a certain 
position in a string. So the analyzer interface isn't an ideal fit for 
the highlighter. The performance hit of running a tokenizer over the 
whole substring shouldn't be a problem but I'd still like to consider 

> I'm actually very excited about getting all that sentence boundary detection
> stuff out of Highlighter.c, which will become much easier to grok and maintain
> as a result.  Separation of concerns FTW!

We could also move the boundary detection to a string utility class.

>> Of course, it would mean to implement a separate Unicode-capable word
>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>> we could reuse parts of the StandardTokenizer.
> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
> It looks much better if you trim excerpts at sentence boundaries, and
> word-break algos don't get you those.

I would keep the sentence boundary detection, of course. I'm only 
talking about the word breaking part.


View raw message