lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Highlighter excerpt boundaries
Date Fri, 20 Jan 2012 00:52:01 GMT
On Thu, Jan 19, 2012 at 11:43:59AM +0100, Nick Wellnhofer wrote:
>> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
>> seems to me as though it would be desirable code re-use to wrap our sentence
>> boundary detection mechanism within a battle-tested design like Analyzer,
>> rather than do something ad-hoc.
>
> The analyzers are designed so split a whole string into tokens. In the  
> highlighter we only need to find a single boundary near a certain  
> position in a string. So the analyzer interface isn't an ideal fit for  
> the highlighter. The performance hit of running a tokenizer over the  
> whole substring shouldn't be a problem but I'd still like to consider  
> alternatives.

It's rare that we need to optimize for performance.  Most of the time we
should be optimizing for maintainability.

I'm advocating using Analyzer because we have several of them, and because the
parallelism between StandardTokenizer and a StandardSentenceTokenizer based on
UAX #29 would lower the cost of maintaining them.

However, that's only one way to optimize for maintainability, and it's not
necessarily the best available stratagem.  It may be that low level code
leveraging an Analyzer is verbose... or not... we'd just have to try.

>> I'm actually very excited about getting all that sentence boundary detection
>> stuff out of Highlighter.c, which will become much easier to grok and maintain
>> as a result.  Separation of concerns FTW!
>
> We could also move the boundary detection to a string utility class.

I suspect that at some point we will want to expose sentence boundary
detection via a public API, because people who subclass Highlighter may want
to use it.  Father Chrysostomos did when he wrote KSx::Highlight::Summarizer.
(The old KinoSearch Highlighter exposed a find_sentences() method at one
point.  It was a victim of the C rewrite; Highlighter was one of the harder
modules to port.)

It seems to me that publishing UAX #29 sentence boundary detection via an
Analyzer is a conservative API extension, since it's so closely related to the
UAX #29 word boundary detection we expose via StandardTokenizer.

So that explains what I was thinking.  But of course refactoring sentence
boundary detection into a string utility function also achieves the end of
cleaning up Highlighter.c just as effectively, and might be more elegant --
who knows?

Until we actually expose this capability via a public API, either approach
should work fine.

>>> Of course, it would mean to implement a separate Unicode-capable word
>>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>>> we could reuse parts of the StandardTokenizer.
>>
>> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
>> It looks much better if you trim excerpts at sentence boundaries, and
>> word-break algos don't get you those.
>
> I would keep the sentence boundary detection, of course. I'm only  
> talking about the word breaking part.

Groovy, sounds like we're on the same page about that then. :)

Marvin Humphrey


Mime
View raw message