Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 53229 invoked from network); 12 Apr 2011 07:10:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Apr 2011 07:10:26 -0000 Received: (qmail 54991 invoked by uid 500); 12 Apr 2011 07:10:26 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 54912 invoked by uid 500); 12 Apr 2011 07:10:25 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 54899 invoked by uid 99); 12 Apr 2011 07:10:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Apr 2011 07:10:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.83.47] (HELO mail-gw0-f47.google.com) (74.125.83.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Apr 2011 07:10:15 +0000 Received: by gwb11 with SMTP id 11so2284479gwb.6 for ; Tue, 12 Apr 2011 00:09:54 -0700 (PDT) Received: by 10.236.182.230 with SMTP id o66mr8448135yhm.24.1302592194079; Tue, 12 Apr 2011 00:09:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.105.240 with HTTP; Tue, 12 Apr 2011 00:09:34 -0700 (PDT) In-Reply-To: <20110412040710.GA32546@rectangular.com> References: <20110401004129.GA14002@rectangular.com> <4D9715C9.206@peknet.com> <20110402180310.GA13116@rectangular.com> <20110403012758.GA13878@rectangular.com> <20110407232933.GB31358@rectangular.com> <20110412040710.GA32546@rectangular.com> From: Nathan Kurz Date: Tue, 12 Apr 2011 00:09:34 -0700 Message-ID: To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Refining Query-to-Matcher compilation On Mon, Apr 11, 2011 at 9:07 PM, Marvin Humphrey w= rote: > On Sun, Apr 10, 2011 at 12:08:05PM -0700, Nathan Kurz wrote: >> > Query objects may also be weighted in a PolySearcher and then passed d= own into >> > a child Searcher. =C2=A0It is essential that the child Searcher know t= hat weighting >> > has already been performed and must not be performed again. >> >> I feel this is an architectural flaw, and that the correct solution is >> that weighting should never be performed automatically. > > Unfortunately, I don't see how that could work. =C2=A0Weighting isn't opt= ional. > > The process of "weighting" a query under TF/IDF involves weighting subque= ries > using IDF, so that if you search for 'new york', the rare term 'york' > contributes more towards the score than the common term 'new'. > > If you don't do that, your search results aren't going to be as relevant = as > they should be -- you're going to get too much 'new' and not enough 'york= '. Yes. I wasn't trying to say that it shouldn't be weighted, but that the weighting should be explicit rather than automatic. I was suggesting that instead of checking whether the weighting has already been done, we provide a means for the weighting to be done and simply require it be used. This is just from general desire to make the code paths as simple and explicit as they can be. >> Assume there is a machine with an known index schema and a net connectio= n: >> exactly what do we need to specify over-the-wire to get the results we w= ant? > > We need to send over a serialized Query which has already been weighted u= sing > aggregate statistics for the entire corpus. > > Right now, that means the Query must be a Lucy::Search::Compiler object. This is sad, but a lot of my difficulties might be purely semantic. I have trouble with Compiler a subclass of Query, and am only starting to understand what you meant by "High Level Query" and "Low Level Query" in some earlier mail. And because of some earlier phrasing about "serializing the Query" I just wasn't seeing that it was actually a Compiler. I thought there was yet another entity involved. I think it's the combination of wrapping a Query and being a Query that confuses me. So Compiler inherits from Query (and thus is a "low level query"?), but TermCompiler does not inherit from TermQuery?. I guess it's that I want them to either always be subclasses or never be, but I'm uneasy about the halfways. I feel like Compiler is trying to do an awful lot of things, few of which really are reflected in its name or parentage. And what would a non-TF/IDF specific form of Lucy::Index::Similarity be ca= lled? > What I think you may be missing is that we need ANDCompiler and TermCompi= ler > in order to *calculate* the values that you would have us insert into AND= Query > and TermQuery. =C2=A0The complex code that performs TF/IDF weighting has = to go > *somewhere* -- TermCompiler and ANDCompiler are that "somewhere". =C2=A0E= ven if we > we were to stop using them as containers, we can't kill them off. I'm missing a lot, but that one I'm getting. My reference to the nonexistent "Scorer" is my attempt to find a proper place for it, where proper is just about anywhere with a clearly delineated boundary. I know this doesn't currently exist, but your MatchEngine and Lucy::Score::TFIDF* hierarchy feels like a good direction to explore. My latest mental failures have been trying to figure out how to shoehorn in geographic distance subqueries. Should be simple, right? --nate