incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Excerpting algos
Date Sun, 07 Jun 2009 01:05:57 GMT

On Jun 6, 2009, at 9:22 AM, Marvin Humphrey wrote:

> On Fri, Jun 05, 2009 at 02:42:46PM -0700, Father Chrysostomos wrote:
> First, a bit of good news:  I've managed to fix the current KS  
> Highlighter
> sentence-boundary trimming implementation without needing to start  
> over from
> scratch, and without causing any problems for the  
> KSx::Highlight::Summarizer
> test suite.  That means we don't have to conclude this discussion  
> and finish
> the implementation to unblock a KS dev release.  (: For better or  
> worse. :)

I don’t know whether you are aware: I cheated and copied & pasted the  
find_sentence_boundaries from KS r3122 to KSx:H:S, since I was in a  
>> Would you extend the Analysis interface to allow for custom sentence
>> algorithms?
> Since this is a tokenization task, Analyzer would be a logical place  
> to turn.
> I think we'll need to make two passes over the text, one for search  
> tokens and
> one for sentences.
> Dow we actually need to extend Analyzer, though?  I think we ought  
> to avoid
> giving Analyzer a Find_Sentences() method.  Instead, we can just  
> create an
> Analyzer instance which tokenizes at sentence boundaries.  Probably  
> we'll want
> to create a dedicated SentenceTokenizer subclass, which would not be  
> publicly
> exposed.

I’ve just had an idea: Since we have 1) words, 2) sentences and 3)  
pages, why not multiple levels of vector information? Or multiple  
‘sets’ (which could be orthogonal/overlapping)? Someone may want to  
include paragraphs or chapters, for instance. Just a thought....

> Instead, we can turn TermVectorsWriter into a public HighlightWriter  
> class
> and give it a Set_Sentence_Tokenizer() method.  Extensibility would  
> happen via
> Architecture:
>  package MyArchitecture;
>  use base qw( KinoSearch::Architecture );
>  sub register_highlight_writer {
>    my ( $self, $seg_writer ) = @_;
>    $self->SUPER::register_highlight_writer($seg_writer);
>    my $hl_writer = $seg_writer- 
> >obtain("KinoSearch::Index::HighlightWriter");
>    $hl_writer->set_sentence_tokenizer( MySentenceTokenizer->new );
>  }

Or maybe $hl_writer->add_tokenizer( MySentenceTokenizer->new );
We may need to distinguish between ‘offset tokenisers’ and ‘term  

> I think this approach will work provided that it's possible to use  
> the same
> sentence boundary detection algo across most or all of the languages  
> supported
> by Snowball.  (Does the basic algo of splitting on /\.\s+/ work for  
> Greek?)

Yes, except for the same problem that it causes in English: ‘M.  
Humphrey‘ becomes two sentences. (As an aside, your default tokeniser  
doesn’t work with Greek, which can have mid-commas, but the only two  
words with mid commas [ὅ,τι and ὅ,τιδηποτε] are stop- 
words, so I don’t worry about it.)

> CJK users and others for whom our algo would fail would need to spec  
> a custom
> Architecture -- though only if they want highlighting, since it's  
> off by
> default.  It's a bit more work for that class of user, but it  
> prevents us from
> having to add clutter to the crucial core classes of Analyzer and  
> Schema.
> It will be somewhat wasteful if we use this SentenceTokenizer class  
> to create
> full fledged tokens when all we need is offsets, but I think we  
> would handle
> further optimizations via natural extensions to either Analyzer or  
> Inversion.
> I say "natural", because we would be merely repurposing the same  
> offset
> information that Tokenizer normally feeds to Token's constructor, as  
> opposed
> to glomming on a Find_Sentences() method which would apply a  
> completely
> different tokenizing algorithm.

Sounds good.

>> Could the sentences be numbered, so the final fragment has  
>> information
>> about *which* sentence it came from? (I could use this for  
>> pagination.)
> I think that would work.  The current "DocVector" class needs to  
> mutate into
> "InvertedDoc" or something like that, and InvertedDoc needs to provide
> sentence boundary information somehow.
> We often need to use iterators for scaling purposes in KS/Lucy, but  
> huge docs
> are problematic for highlighting anyway, so I think we can just go  
> with two
> i32_t arrays: one each for sentence_offsets and sentence_lengths.   
> In the
> index, we'd probably store this information as a string of delta- 
> encoded


> C32s
> representing offset from the top of the field measured in Unicode code
> points.
> Source:
>  "Best. Joke. Ever."
> Search-time:
>  $inverted_doc->get_sentence_offsets; # [ 0, 6, 12 ]
>  $inverted_doc->get_sentence_lengths; # [ 5, 5, 5 ]
> In the index:
>  0, 5, 1, 5, 1, 5
> That preserves your requested sentence numbering information through  
> read
> time, accessible as array tick in the sentence_offset and  
> sentence_lengths
> arrays.
>>> Perhaps if each Span were to include a reference to the original   
>>> Query
>>> object which produced it?  These would be primitives such as  
>>> TermQuery and
>>> PhraseQuery rather than compound queries like ANDQuery.  Would that
>>> reference be enough to implement a preference for term diversity  
>>> in the
>>> excerpting algo?
>> There is one scenario I can think of where that *might* not work. If
>> someone searches for a list of keywords that includes the same  
>> keyword
>> twice (e.g., I sometimes copy and paste a sentence to find documents
>> with similar content), then there will be two TermQueries that are
>> identical but considered different.
> All Query classes should be implementing the Equals() method so that  
> logically
> equivalent objects can be identified.  Does that address your concern?

> We'll probably want to reference the Compiler/Weight rather than the  
> original
> Query; right now in KS I don't think I have Equals() implemented for  
> any
> Compiler classes, but that shouldn't be hard to finish.  [1]
>> Maybe this won’t matter because  the duplicate term should have  
>> extra
>> weight. I haven’t thought this through.
> I think the only way we'll nail the extensibility aspect of this  
> design is if
> we build working implementations for multiple highlighting algorithms.
> Probably your Summarizer and a class which implements the
> term-diversity-preferring algo described by Michael Busch and Mike  
> McCandless
> from LUCENE-1522 would be enough.

I would like to make Summarizer value term diversity, so we’ll be  
left with one. I could make it an option instead.

>>> And might that information come in handy for other excerpting algos?
>> As long as the supplied Term/PhraseQuery is the original object, and
>> not a clone, I think it would.
> I think you say that because of the equivalence question, right?


> The KS Highlighter creates its own internal Compiler object using  
> the supplied
> "searchable" and "query" constructor args.  The DocVector/ 
> InvertedDoc has to
> be able to go over the network, but the score spans won't -- so each  
> score
> span would always be pointing to some sub-component of that local  
> Compiler
> object.
> I'm not entirely satisfied with this approach.  The Span class has  
> been simple
> up till now -- it *could* have been sent over the network with no  
> problem.
> Bloating it up with a reference to the Query/Compiler makes it both  
> less
> general and less transportable.

How about $compiler->give_me_the_query_for($span)? (with a better  
method name, of course.) Or would that make Compiler too complex,  
since it would have to store a hash (or equivalent) in addition to its  
array of spans?

But I thought queries could be sent over the network.

Father Chrysostomos

View raw message