Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Tue, 16 Mar 2010 14:44:42 -0700
To: lucy-dev@lucene.apache.org
Subject: Re: MoreLikeThisQuery
Message-ID: <20100316214442.GA30856@rectangular.com>
References: <20100316051735.GB27885@rectangular.com> <4B9FCBAE.40406@aevum.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4B9FCBAE.40406@aevum.de>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
> What's the easiest way to get to the > term-document matrix either during or
> after indexing?

I'm not sure what format would be most helpful for you.  Here's code to
iterate over all terms and all postings in all segments for the "content"
field:

  my $poly_reader = KinoSearch::Index::PolyReader->open( 
    index => '/path/to/index',
  );  
  my %postings;
  my $offset = 0;
  for my $seg_reader ( @{ $poly_reader->seg_readers } ) { 
    my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
    my $plist_reader
      = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
    my $lexicon = $lex_reader->lexicon( field => 'content');
    my $plist = $plist_reader->posting_list( field => 'content' );
    while ($lexicon->next) {
      my $term = $lexicon->get_term;
      warn $term;
      $postings{$term} ||= []; 
      my $doc_id_array = $postings{$term};
      $plist->seek($term);
      while (my $seg_doc_id = $plist->next) {
        push @$doc_id_array, $seg_doc_id + offset;
      }   
    }   
    $offset += $seg_reader->doc_max;
  }

Does that at least provide a point of departure?

> I'm not sure clustering really helps here. Suppose that each half of the
> search terms is from one of two clusters both of which are relevant to
> the query. Do you really want to exclude one of the clusters?

The number one goal is to exclude high-value terms which are outliers.  So  
long as that is achieved, we will remove many painfully wrong results.

Beyond that, my intuition is that whether we focus on one cluster or allow
multiple clusters is less important.  Focusing on one cluster might improve
precision at the expense of recall -- or it might just hurt recall, I don't
know.  Relevance testing of the kind that Robert is talking about could help
us determine that.

Marvin Humphrey