Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 55550 invoked from network); 12 Apr 2011 04:11:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Apr 2011 04:11:42 -0000 Received: (qmail 40700 invoked by uid 500); 12 Apr 2011 04:11:42 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 40674 invoked by uid 500); 12 Apr 2011 04:11:41 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 40664 invoked by uid 99); 12 Apr 2011 04:11:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Apr 2011 04:11:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Apr 2011 04:11:33 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1Q9Usg-0008UR-4D for lucy-dev@incubator.apache.org; Mon, 11 Apr 2011 21:07:10 -0700 Date: Mon, 11 Apr 2011 21:07:10 -0700 From: Marvin Humphrey To: lucy-dev@incubator.apache.org Message-ID: <20110412040710.GA32546@rectangular.com> References: <20110401004129.GA14002@rectangular.com> <4D9715C9.206@peknet.com> <20110402180310.GA13116@rectangular.com> <20110403012758.GA13878@rectangular.com> <20110407232933.GB31358@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Subject: Re: [lucy-dev] Refining Query-to-Matcher compilation On Sun, Apr 10, 2011 at 12:08:05PM -0700, Nathan Kurz wrote: > > Query objects may also be weighted in a PolySearcher and then passed down into > > a child Searcher. �It is essential that the child Searcher know that weighting > > has already been performed and must not be performed again. > > I feel this is an architectural flaw, and that the correct solution is > that weighting should never be performed automatically. Unfortunately, I don't see how that could work. Weighting isn't optional. The process of "weighting" a query under TF/IDF involves weighting subqueries using IDF, so that if you search for 'new york', the rare term 'york' contributes more towards the score than the common term 'new'. If you don't do that, your search results aren't going to be as relevant as they should be -- you're going to get too much 'new' and not enough 'york'. > It's OK if the default QueryParser does the optimization, but the engine > should run exactly the Query it's passed. If we discount QueryParser's Prune() method (which post-processes the Query in a certain way but doesn't do performance optimization), QueryParser doesn't do any optimization at all -- and I think that's the right behavior. Query optimization should be left to a later stage, rather than performed in the context of the parser. > In the same way, the weighting needs to be independent of the "engine". OK, I see what you're getting at. This has been a clarifying discussion. I disagree. Instead, I concur with Robert that weighting should be opaque and specific to the matching engine. > Assume there is a machine with an known index schema and a net connection: > exactly what do we need to specify over-the-wire to get the results we want? We need to send over a serialized Query which has already been weighted using aggregate statistics for the entire corpus. Right now, that means the Query must be a Lucy::Search::Compiler object. > Because the corpus statistics are only known by the parent, to me it makes > no sense to do the weighting on the child: Absolutely. Weighting must be done at the parent level -- i.e. in the context of the top-level Searcher. The Searcher defines the document collection -- across all segments in all indexes. > what information needs to be sent as part of the search request for a > specific cases? We want to search ["this" AND "that"], weighted TF/IDF, > returning top 10 scores. What bytes form the Request that we need to send > to the child? To see a dump of the Compiler, take a look below my sig. (The full request would have num_wanted, etc.) > Presuming we know the full corpus statistics on the parent, I think we > can just serialize a pre-weighted query, specify the name of a Scorer > (one that adds subqueries), and that we want only the top 10 results. > I don't think the child needs to know whether we are using TF/IDF, > TF/IFC, or BM25. What am I missing? In a sense, you're not missing anything. :) I believe that for those three scoring models, you are correct that it's possible to encode the required information using standard ANDQuery and TermQuery objects. And furthermore, I understand that because we can technically use ANDQuery and TermQuery as containers instead of ANDCompiler and TermCompiler, you would like us to eliminate ANDCompiler and TermCompiler, simplifying the code base. We can't. What I think you may be missing is that we need ANDCompiler and TermCompiler in order to *calculate* the values that you would have us insert into ANDQuery and TermQuery. The complex code that performs TF/IDF weighting has to go *somewhere* -- TermCompiler and ANDCompiler are that "somewhere". Even if we we were to stop using them as containers, we can't kill them off. Marvin Humphrey #---------------------------------------------------------------------------- # This is a dump of a Compiler for the query string 'this AND that', weighted # using the US Constitution corpus. #---------------------------------------------------------------------------- $VAR1 = { '_class' => 'Lucy::Search::ANDCompiler', 'boost' => '1', 'children' => [ { '_class' => 'Lucy::Search::TermCompiler', 'boost' => '1', 'idf' => '1.81575', 'normalized_weight' => '1.09951', 'parent' => { '_class' => 'Lucy::Search::TermQuery', 'boost' => '1', 'field' => 'content', 'term' => 'this' }, 'query_norm_factor' => '0.333494', 'raw_weight' => '1.81575', 'sim' => { '_class' => 'Lucy::Index::Similarity' } }, { '_class' => 'Lucy::Search::TermCompiler', 'boost' => '1', 'idf' => '2.38629', 'normalized_weight' => '1.89905', 'parent' => { '_class' => 'Lucy::Search::TermQuery', 'boost' => '1', 'field' => 'content', 'term' => 'that' }, 'query_norm_factor' => '0.333494', 'raw_weight' => '2.38629', 'sim' => { '_class' => 'Lucy::Index::Similarity' } } ], 'parent' => { '_class' => 'Lucy::Search::ANDQuery', 'boost' => '1', 'children' => [ { '_class' => 'Lucy::Search::TermQuery', 'boost' => '1', 'field' => 'content', 'term' => 'this' }, { '_class' => 'Lucy::Search::TermQuery', 'boost' => '1', 'field' => 'content', 'term' => 'that' } ] }, 'sim' => { '_class' => 'Lucy::Index::Similarity' } };