incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Excerpting algos
Date Sat, 06 Jun 2009 16:22:53 GMT
On Fri, Jun 05, 2009 at 02:42:46PM -0700, Father Chrysostomos wrote:

First, a bit of good news:  I've managed to fix the current KS Highlighter
sentence-boundary trimming implementation without needing to start over from
scratch, and without causing any problems for the KSx::Highlight::Summarizer
test suite.  That means we don't have to conclude this discussion and finish
the implementation to unblock a KS dev release.  (: For better or worse. :)

> >Right now in the KS implementation, sentence boundary information is
> >calculated on the fly at runtime, via Highlighter_Find_Sentences().
> >However, this seems wasteful, because sentence boundaries can be known at
> >index-time.  Perhaps we ought to be storing sentence boundary information
> >in the  index.
> Would you extend the Analysis interface to allow for custom sentence  
> algorithms? 

Since this is a tokenization task, Analyzer would be a logical place to turn.
I think we'll need to make two passes over the text, one for search tokens and
one for sentences.

Dow we actually need to extend Analyzer, though?  I think we ought to avoid
giving Analyzer a Find_Sentences() method.  Instead, we can just create an
Analyzer instance which tokenizes at sentence boundaries.  Probably we'll want
to create a dedicated SentenceTokenizer subclass, which would not be publicly

Instead, we can turn TermVectorsWriter into a public HighlightWriter class
and give it a Set_Sentence_Tokenizer() method.  Extensibility would happen via 

  package MyArchitecture;
  use base qw( KinoSearch::Architecture );

  sub register_highlight_writer {
    my ( $self, $seg_writer ) = @_;
    my $hl_writer = $seg_writer->obtain("KinoSearch::Index::HighlightWriter");
    $hl_writer->set_sentence_tokenizer( MySentenceTokenizer->new );

I think this approach will work provided that it's possible to use the same
sentence boundary detection algo across most or all of the languages supported
by Snowball.  (Does the basic algo of splitting on /\.\s+/ work for Greek?)
CJK users and others for whom our algo would fail would need to spec a custom
Architecture -- though only if they want highlighting, since it's off by
default.  It's a bit more work for that class of user, but it prevents us from
having to add clutter to the crucial core classes of Analyzer and Schema.  

It will be somewhat wasteful if we use this SentenceTokenizer class to create
full fledged tokens when all we need is offsets, but I think we would handle
further optimizations via natural extensions to either Analyzer or Inversion.
I say "natural", because we would be merely repurposing the same offset
information that Tokenizer normally feeds to Token's constructor, as opposed
to glomming on a Find_Sentences() method which would apply a completely
different tokenizing algorithm.

> Could the sentences be numbered, so the final fragment has information
> about *which* sentence it came from? (I could use this for pagination.)

I think that would work.  The current "DocVector" class needs to mutate into
"InvertedDoc" or something like that, and InvertedDoc needs to provide
sentence boundary information somehow.

We often need to use iterators for scaling purposes in KS/Lucy, but huge docs
are problematic for highlighting anyway, so I think we can just go with two
i32_t arrays: one each for sentence_offsets and sentence_lengths.  In the
index, we'd probably store this information as a string of delta-encoded C32s
representing offset from the top of the field measured in Unicode code


  "Best. Joke. Ever."


  $inverted_doc->get_sentence_offsets; # [ 0, 6, 12 ]
  $inverted_doc->get_sentence_lengths; # [ 5, 5, 5 ]

In the index:

  0, 5, 1, 5, 1, 5

That preserves your requested sentence numbering information through read
time, accessible as array tick in the sentence_offset and sentence_lengths

> >Perhaps if each Span were to include a reference to the original  Query
> >object which produced it?  These would be primitives such as TermQuery and
> >PhraseQuery rather than compound queries like ANDQuery.  Would that
> >reference be enough to implement a preference for term diversity in the
> >excerpting algo?
> There is one scenario I can think of where that *might* not work. If  
> someone searches for a list of keywords that includes the same keyword  
> twice (e.g., I sometimes copy and paste a sentence to find documents  
> with similar content), then there will be two TermQueries that are  
> identical but considered different. 

All Query classes should be implementing the Equals() method so that logically
equivalent objects can be identified.  Does that address your concern?

We'll probably want to reference the Compiler/Weight rather than the original
Query; right now in KS I don't think I have Equals() implemented for any
Compiler classes, but that shouldn't be hard to finish.  [1]

> Maybe this won’t matter because  the duplicate term should have extra
> weight. I haven’t thought this through.

I think the only way we'll nail the extensibility aspect of this design is if
we build working implementations for multiple highlighting algorithms.

Probably your Summarizer and a class which implements the
term-diversity-preferring algo described by Michael Busch and Mike McCandless
from LUCENE-1522 would be enough.

> >And might that information come in handy for other excerpting algos?
> As long as the supplied Term/PhraseQuery is the original object, and  
> not a clone, I think it would.

I think you say that because of the equivalence question, right?

The KS Highlighter creates its own internal Compiler object using the supplied
"searchable" and "query" constructor args.  The DocVector/InvertedDoc has to
be able to go over the network, but the score spans won't -- so each score
span would always be pointing to some sub-component of that local Compiler

I'm not entirely satisfied with this approach.  The Span class has been simple
up till now -- it *could* have been sent over the network with no problem.
Bloating it up with a reference to the Query/Compiler makes it both less
general and less transportable.

Marvin Humphrey

[1] Compiler is a subclass of Query in KS.  This is different from Lucene,
    where Weight does not subclass Query.

View raw message