Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 77687 invoked from network); 14 Apr 2011 04:02:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2011 04:02:41 -0000 Received: (qmail 54651 invoked by uid 500); 14 Apr 2011 04:02:41 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 54258 invoked by uid 500); 14 Apr 2011 04:02:40 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 54244 invoked by uid 99); 14 Apr 2011 04:02:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 04:02:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.83.47] (HELO mail-gw0-f47.google.com) (74.125.83.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 04:02:32 +0000 Received: by gwb11 with SMTP id 11so591704gwb.6 for ; Wed, 13 Apr 2011 21:02:11 -0700 (PDT) Received: by 10.236.73.163 with SMTP id v23mr228637yhd.359.1302753731101; Wed, 13 Apr 2011 21:02:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.105.240 with HTTP; Wed, 13 Apr 2011 21:01:51 -0700 (PDT) In-Reply-To: <20110413192109.GA30277@rectangular.com> References: <20110401004129.GA14002@rectangular.com> <4D9715C9.206@peknet.com> <20110402180310.GA13116@rectangular.com> <20110403012758.GA13878@rectangular.com> <20110407232933.GB31358@rectangular.com> <20110412040710.GA32546@rectangular.com> <20110413192109.GA30277@rectangular.com> From: Nathan Kurz Date: Wed, 13 Apr 2011 21:01:51 -0700 Message-ID: To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Exile query weighting code to Siberia On Wed, Apr 13, 2011 at 12:21 PM, Marvin Humphrey wrote: > Moving weighting out of the library and into application space would incr= ease > the complexity of user code, ipso facto: > > =C2=A0 =C2=A0 my $query =3D $query_parser->parse($query_string); > + =C2=A0 =C2=A0$query =3D $query->weight(searcher =3D> $searcher); > =C2=A0 =C2=A0 my $hits =3D $searcher->hits(query =3D> $query); It doesn't have to be in the application level --- I'd be perfectly happy to have it happen in the query parser, so long as the query parser was clearly written and self-contained, so that one could confidently rewrite it to use a different weighting scheme without full knowledge of everything that happens afterward. Gravy would be if that query parser was contained in a subclass specific to TFIDF: my $query =3D new Lucy::TFIDF::Query($query_string); > For TF/IDF, queries should *always* be weighted, so if we made this chang= e the > user would simply become responsible for manually executing a step that L= ucy > performs automatically right now. Sure, but so long as the rules are clear it isn't that onerous. The reality is that most new users are going to cut and paste from your sample program, and so long as the sample includes this line they are unlikely to go out of their way to remove it. > I think many users would be surprised and confused if we started requirin= g > them to take charge of query weighting. I might even have to start counting on my toes! ;) > The proposal makes perfect sense, though, if scoring isn't important to y= ou. Or if scoring is very important to you. It makes less sense if what you want is an out-of-the-box no configuration search box for your text based web site. > What if Lucy was a boolean matching engine, which you could hack to augme= nt > with TF/IDF scores? =C2=A0What if TF/IDF was an add-on, and all TF/IDF we= ighting > code lived outside of core? =C2=A0What if only a tiny fraction of Lucy's = users > needed to weight their queries? There's of course the question about what Core means here. I think TF/IDF should certainly be part of the core distribution, but it would be great if it could be compartmentalized. > If all that were true, Lucy's internals could be simplified considerably.= =C2=A0All > of the weighting code would be gone -- we wouldn't have to think about it= in > either single-node or search-cluster context. =C2=A0Lucy::Search::Compile= r would be > gone and we would all just pass around Query objects. =C2=A0Only the TF/I= DF weirdos > would stuff those bizarre calls to $query->weight into their application > code... I can't quite tell how much I'm being mocked here. I guessing you're trying your best to express a point of view that you don't quite share. No offense in either case, though, as I'm sure many things I suggest are quite deserving of considerable mockery. Everyone needs their queries to be weighed in some way, even if that weighting is constant. TF/IDF is a fine and venerable default weighting, if you happen to be indexing books, or blog posts or magazine articles. But if you are indexing something like names, titles, lists of properties, inverse document frequency doesn't have the same resonance. And although it may be largely semantic, I really do like the idea of passing around a Query rather than a Compiler. Especially if we could keep the Query as simply a canonical representation of a search request, and split all the other duties off into their own well contained classes. >=C2=A0If you are browsing through the Lucy code base trying to > understand how everything fits together -- or trying to implement your ow= n > matching framework on top of those Query classes -- that's going to make > things a lot easier. I do think that simplifying the structure would go a long way in making modifications more accessible. Compiler really feels like a catch-all, and yet it's not even in it's own hierarchy. Pop quiz: how many people on this list know that the code for the TF/IDF specific TermCompiler be found in the file TermQuery.c? And how many of those think it belongs there? >> I know this doesn't currently exist, but your MatchEngine and >> Lucy::Score::TFIDF* hierarchy feels like a good direction to explore. > > Groovy. =C2=A0Though I'm not sure where the TF/IDF code will end up yet, = I think > simplifying the *Query.c files ought to be one of the goals of this > refactoring round. Sounds like a great goal to me! --nate