incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: MoreLikeThisQuery
Date Tue, 16 Mar 2010 21:44:42 GMT
On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
> What's the easiest way to get to the > term-document matrix either during or
> after indexing?

I'm not sure what format would be most helpful for you.  Here's code to
iterate over all terms and all postings in all segments for the "content"

  my $poly_reader = KinoSearch::Index::PolyReader->open( 
    index => '/path/to/index',
  my %postings;
  my $offset = 0;
  for my $seg_reader ( @{ $poly_reader->seg_readers } ) { 
    my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
    my $plist_reader
      = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
    my $lexicon = $lex_reader->lexicon( field => 'content');
    my $plist = $plist_reader->posting_list( field => 'content' );
    while ($lexicon->next) {
      my $term = $lexicon->get_term;
      warn $term;
      $postings{$term} ||= []; 
      my $doc_id_array = $postings{$term};
      while (my $seg_doc_id = $plist->next) {
        push @$doc_id_array, $seg_doc_id + offset;
    $offset += $seg_reader->doc_max;

Does that at least provide a point of departure?

> I'm not sure clustering really helps here. Suppose that each half of the
> search terms is from one of two clusters both of which are relevant to
> the query. Do you really want to exclude one of the clusters?

The number one goal is to exclude high-value terms which are outliers.  So  
long as that is achieved, we will remove many painfully wrong results.

Beyond that, my intuition is that whether we focus on one cluster or allow
multiple clusters is less important.  Focusing on one cluster might improve
precision at the expense of recall -- or it might just hurt recall, I don't
know.  Relevance testing of the kind that Robert is talking about could help
us determine that.

Marvin Humphrey

View raw message