lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <>
Subject Re: [lucy-dev] Three class name changes
Date Mon, 04 Apr 2011 19:44:41 GMT
Thanks for the write up.  I think this level of detail is very
helpful, both for discussion now, as a basis for an overview document
in the future.   It provides footholds for those starting with the
code, and I think it also illustrates the complexity of the current

While far from a, there is
definitely a certain accretive tendency and quite a few historical
artifacts.   Rather than coming up with exact names for a variety of
subtly distinguished objects, I think we would be better to collapse
some of these extra entities into something simpler.

You are right that that just getting 0.1.0 released is the top
priority.  If this is truly the goal, I would suggest _not_ renaming
anything.  But if it's important enough to change (and I think it
might be) I would do it right, rather than just cosmetically.  And I
feel pretty certain that this going deeper than just coming up with
the right names for the existing structures.

So I'll focus on a single (contentious) point:  As an object, Compiler
can just go away.  It's a silly optimization for TF/IDF scoring and
thus shouldn't be a primary architectural feature.  It's just a Query
with the boosts filled in.

I've reordered your comments liberally to respond to them more easily.

And I'm presuming that scaling to multiple machines is the end goal.

> However, I do not wish to pursue MatchEngine right this moment.

OK, that's off limits for now.

> The canonical way to express Lucy search queries is using Query objects.
> There is no syntax which is convenient enough for user search boxes which can
> be used to perform lossless round-trip serialization of arbitrary Queries.

Yes. One can presumably serialize a query object to text, cannot
reliable go round trip from user input to Query back to user input.

> A Query is a simple container which contains the minimum information necessary
> to define an abstract search query.  It is not specific to any index or
> collection of documents.

Mostly.  How about "is not tied to any specific index"?   If one had a
distributed system, one might convert user input to a query, and then
send that canonical query to workers.  But it does presume a certain
schema for field names, etc.   And why "simple" and "minimal"?

> A Query object cannot perform [Highlighting' because it does not have access
> to weighting information and thus cannot calculate the score for each snippet.

That sounds easy to fix:  we've got a boost field, let's use it!   If
more than a floating point boost field is needed, let's add those
fields to Query.

> We should probably formalize the notion that Lucy is not tied to any
> particular query language.

Yes.  Also, there should be nothing magic about the standard parser.
So long as you can create a standard Query object, you can use it.

Foreshadowing:  perhaps different scoring methods should create their
own Query objects.  Or better yet, "boost" existing Queries so that
parsing and boosting are independent.

TF/IDF (I'm not actually against it, just having it define our
architecture) requires access to full collection statistics
(Searcher), but can't this be done at Query creation or just after?

query = new Lucy::Query("this AND that");
Lucy::TFIDF::Boost(query, Searcher);

query = new Custom::Query("este & ese");
Custom::Boost(query, Searcher, IP, flags, whatever);

query = new One::Stop::Boosted::Query("user input", flags, boost_parameters);

The key for me is that there is a standard Query that gets created.
It can be a subclass of of course, but shouldn't be required to be
one.   And once the (optional) boosting is done, it's standalone and
ready to be passed off a a machine that does not have access to the
full corpus statistics.

> A Compiler/Investigation is created using the combination of a Query and a
> Searcher.  It inspects the Query and performs weighting based on statistics
> obtained from the Searcher.  Once the weighting is finished, it is ready to
> perform its role as a factory which creates Matchers from supplied SegReaders.

Ditch it.  Or in keeping with the "no MatchEngine" approach, use it as
an internal class to create a Matcher or MatcherFactory, but never
pass a Compiler object around.  Matcher can have reference to it's
parent Query if needed.

> The advantage of delegating to Compiler/Investigation is that Highlighter can
> be made to work against arbitrary Query subclasses -- Highlighter itself
> doesn't need to be extended.

Great advantage, let's keep it.  But once Query is pre-weighted this
is solved, right?

> A Matcher is a low-level object which operates against a single segment of an
> index.  It determines which documents within that segment match the Query, and
> optionally, calculates a score expressing how well each document matched.

Remind me:  how does the Matcher get associated with the segment?  I
started to look, but got lost in the twisty paths.  It's magic, right?

> In addition, we have the the summary I wrote above.  It omits IndexReader,
> PolyReader, Collector, and all the details underneath Matcher -- but perhaps
> that is for the best.

The summary above is great, but I it needs to be fleshed out.   The
problem is that right now there are lots of dependencies on those
pieces that are "underneath" Matcher, and generally lots of
interdependence between the pieces that make changing out any single
piece really hard.   Documenting this is the first step to fixing it.

> Right now, Query creates Compiler and Compiler create Matcher -- but that means
> that in order to plug in an alternate scoring engine, you have to subclass every Query
> class.

Which is clearly insanity, but is not actually due to
Query->Compiler->Matcher. Rather, it's the continued interdependence
after each creates the other, in the sense that each build on the
other only a small amount.  But I think we can solve this quite easily
at least for TF/IDF, which will continue to be the default scoring,
and this will give us a much easier path forward for alternative
scoring arrangements.

>> Do we really need an Investigator object per se, or just a class that
>> contains some functions for creating a Matcher?
> Exactly the issue that I feel we must postpone :) though I hope it's clear
> from my response above that I'm excited about the possibilities.

Back to pragmatism:  if the goal is to get 0.1.0 out the door, stop
worrying about naming stuff and release it.  But if the goal is to
provide a better foundation strip out TF/IDF and then put it back in a
sane manner.   I think the sanest manner would be to have it over and
done with after the Query creation phase, so that it's tentacles don't
extend throughout the rest of the system.   I think thinking in terms
of Boosting a Query is a good approach, and I think it can be done
with standard query fields rather than per scorer subclasses.


View raw message