Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@minotaur.apache.org Received: (qmail 98624 invoked from network); 16 Mar 2010 21:45:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 21:45:10 -0000 Received: (qmail 82820 invoked by uid 500); 16 Mar 2010 21:45:10 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 82795 invoked by uid 500); 16 Mar 2010 21:45:10 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 82784 invoked by uid 99); 16 Mar 2010 21:45:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 21:45:10 +0000 X-ASF-Spam-Status: No, hits=-1.1 required=10.0 tests=AWL,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 21:45:04 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1NreZ8-000828-TQ for lucy-dev@lucene.apache.org; Tue, 16 Mar 2010 14:44:42 -0700 Date: Tue, 16 Mar 2010 14:44:42 -0700 To: lucy-dev@lucene.apache.org Subject: Re: MoreLikeThisQuery Message-ID: <20100316214442.GA30856@rectangular.com> References: <20100316051735.GB27885@rectangular.com> <4B9FCBAE.40406@aevum.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4B9FCBAE.40406@aevum.de> User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote: > What's the easiest way to get to the > term-document matrix either during or > after indexing? I'm not sure what format would be most helpful for you. Here's code to iterate over all terms and all postings in all segments for the "content" field: my $poly_reader = KinoSearch::Index::PolyReader->open( index => '/path/to/index', ); my %postings; my $offset = 0; for my $seg_reader ( @{ $poly_reader->seg_readers } ) { my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader"); my $plist_reader = $seg_reader->obtain("KinoSearch::Index::PostingListReader"); my $lexicon = $lex_reader->lexicon( field => 'content'); my $plist = $plist_reader->posting_list( field => 'content' ); while ($lexicon->next) { my $term = $lexicon->get_term; warn $term; $postings{$term} ||= []; my $doc_id_array = $postings{$term}; $plist->seek($term); while (my $seg_doc_id = $plist->next) { push @$doc_id_array, $seg_doc_id + offset; } } $offset += $seg_reader->doc_max; } Does that at least provide a point of departure? > I'm not sure clustering really helps here. Suppose that each half of the > search terms is from one of two clusters both of which are relevant to > the query. Do you really want to exclude one of the clusters? The number one goal is to exclude high-value terms which are outliers. So long as that is achieved, we will remove many painfully wrong results. Beyond that, my intuition is that whether we focus on one cluster or allow multiple clusters is less important. Focusing on one cluster might improve precision at the expense of recall -- or it might just hurt recall, I don't know. Relevance testing of the kind that Robert is talking about could help us determine that. Marvin Humphrey