lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Excerpting algos
Date Tue, 09 Jun 2009 01:10:46 GMT
On Sat, Jun 06, 2009 at 06:05:57PM -0700, webmasters@ctosonline.org wrote:

> I don’t know whether you are aware: I cheated and copied & pasted the  
> find_sentence_boundaries from KS r3122 to KSx:H:S, since I was in a  
> hurry.

Yes, I've seen.  FWIW, find_sentence_boundaries(), which returned an array of
integer positions, has been replaced by Find_Sentences(), which returns an
array of Spans.  The new implementation is more powerful because Spans can
carry both sentence start and length info.

However, since Summarizer overrides Create_Excerpt() and doesn't rely on the
no-longer-extant default implementation of find_sentence_boundaries(), there's
no conlict.

After the work I did on Friday tightening up the sentence boundary trimming,
the new implementation has gone from unacceptably buggy to
better-than-the-old.  There's still room for improvement, but now Highlighter
manages to start and end excerpts on sentence boundaries more often, and
applies ellipses more cleanly.

> I’ve just had an idea: Since we have 1) words, 2) sentences and 3)  
> pages, why not multiple levels of vector information? Or multiple  
> ‘sets’ (which could be orthogonal/overlapping)? Someone may want to  
> include paragraphs or chapters, for instance. Just a thought....

This makes sense.  And it seems like we ought to be abled to integrate
hierarchical tokenization info into our file format.

Nevertheless, I would argue that sentences are a uniquely important unit with
regards to excerpting highlighting, because of the size of the typical
excerpt, and because we are visually sensitive to sentence construction in a
way that we wouldn't be for e.g. chapters.

> Or maybe $hl_writer->add_tokenizer( MySentenceTokenizer->new );
> We may need to distinguish between ‘offset tokenisers’ and ‘term  
> tokenisers’.

Well, TermVectorsWriter's Add_Inverted_Doc() method takes an Inverter, which
contains Inversion objects for each fulltext field; and these Inversion
objects have already been inverted via their specified Analyzers by the time
TermVectorsWriter sees them.  So there isn't really a need for the
TermVectorsWriter to be aware of the stock Analyzers.

But beyond that, as we add to the stack of tokenization levels, we probably
need to name them.
    
    $hl_writer->add_tokenization(
        name      => 'sentences'
        tokenizer => MySentenceTokenizer->new,
    );
    $hl_writer->add_tokenization(
        name      => 'paragraphs'
        tokenizer => MyParagraphTokenizer->new,
    );

Anybody who exploits this will have to subclass Architecture, though. :\

> Yes, except for the same problem that it causes in English: ‘M. Humphrey‘
> becomes two sentences. 

Quoting from <http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation>:

    The standard 'vanilla' approach to locate the end of a sentence:

        (a) If it's a period, it ends a sentence.
        (b) If the preceding token is on my hand-compiled list of
            abbreviations, then it doesn't end a sentence.
        (c) If the next token is capitalized, then it ends a sentence.

    This strategy gets about 95% of sentences correct.[2]

We'll have to rely on a heuristic rather than hand-compiling for
abbreviations, but that algo seems like it ought to be good enough for
highlighting.  If we mess up and mistake an abbreviation for a sentence
boundary, it's not the end of the world.

> ‘Delta-encoded’?

It's a compression technique which works very well with certain data patterns
including, as in this case, sorted integer sets: store the differences rather
than the integers, so "9, 18, 27" gets stored as "9, 9, 9".
<http://en.wikipedia.org/wiki/Delta_encoding>

> I would like to make Summarizer value term diversity, so we’ll be  
> left with one. I could make it an option instead.

Your call.  Whatever is best for Summarizer.

> >I'm not entirely satisfied with this approach.  The Span class has  been
> >simple up till now -- it *could* have been sent over the network with no
> >problem.  Bloating it up with a reference to the Query/Compiler makes it
> >both  less general and less transportable.
> 
> How about $compiler->give_me_the_query_for($span)? (with a better  
> method name, of course.) 

Mmm... that sounds promising.  :)

> Or would that make Compiler too complex,  since it would have to store a
> hash (or equivalent) in addition to its  array of spans?

Compiler implementations needn't modify their state to store the array of
spans, because all the necessary materials are passed in to Highlight_Spans()
via arguments.

Perhaps we could modify the DocVector/InvertedDoc object?

I'd actually like to rename Compiler_Highlight_Spans() to
Compiler_Score_Spans() and to keep it generic, because I have the feeling that
the spans will come in handy for other introspection techniques in addition to
highlighting.  However, I'm not sure that the following code would work:

  my $spans = $compiler->score_spans(
    inverted_doc => $inverted_doc,
    searchable   => $searcher,
    field        => $field,
  );
  $inverted_doc->set_spans($spans);
  for my $span (@$spans) {
      my $atomic_compiler = $inverted_doc->compiler_for_span($span);
      ...
  }

> But I thought queries could be sent over the network.

They can, but we have a serialization graph problem.  We might start off with
a lot of small Span objects all containing a reference to the same Query (e.g.
a TermQuery for a common term which matches several places in the same field)
but once they go over the wire, they lose the reference.  We can serialize the
Query anew with each Span, but that's wasteful -- especially since each Span
is only two 32-bit integers and a float.  The alternative is to design a
special container and handler to transport everything at once and rebuild the
relationships on arrival, but that's a pain and means a bunch of
special-casing.

Marvin Humphrey


Mime
View raw message