incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Three class name changes
Date Sun, 03 Apr 2011 01:27:58 GMT
On Sat, Apr 02, 2011 at 12:39:10PM -0700, Nathan Kurz wrote:
> I'm almost in favor of Investigation because it's so clunky that we are sure
> to get rid of the class just so we can stop looking at the name. :)

*grin*

> This confusion isn't because one assumes "C Compiler", but rather why any
> Compiler would do this.  

An insightful observation.  

> Why does it create raw highlighting data again?   

Here is the relevant method:

    /** Return an array of Span objects, indicating where in the given field
     * the text that matches the parent Query occurs and how well each snippet
     * matches.  The Span's offset and length are measured in Unicode code
     * points.  
     *
     * The default implementation returns an empty array.       
     * 
     * @param searcher A Searcher.
     * @param doc_vec A DocVector.
     * @param field The name of the field.
     */
    public incremented VArray*
    Highlight_Spans(Compiler *self, Searcher *searcher, 
                    DocVector *doc_vec, const CharBuf *field);

A Query object cannot perform this operation because it does not have access
to weighting information and thus cannot calculate the score for each snippet.

The Highlighter uses the generated Span objects to build a HeatMap, selects
the "hottest" part of the text as a excerpt, and then uses the Spans which
match the excerpt to determine where to insert highlight tags.

The advantage of delegating to Compiler/Investigation is that Highlighter can
be made to work against arbitrary Query subclasses -- Highlighter itself
doesn't need to be extended.

> I think it would help tremendously to refine the high level overview of how
> the classes interact.  Once we are able to discuss that succinctly, I think
> we'll have a much easier time naming the individual classes.   

A Query is a simple container which contains the minimum information necessary
to define an abstract search query.  It is not specific to any index or
collection of documents.  

A Matcher is a low-level object which operates against a single segment of an
index.  It determines which documents within that segment match the Query, and
optionally, calculates a score expressing how well each document matched.

When preparing a Matcher which must calculate scores, it is necessary to
perform weighting based on statistical data about the aggregate collection of
documents.  Performing this weighting allows scores against different segments
to be compared meaningfully.

However, each segment only only knows its *own* statistics -- and thus the
combination of a Query and a SegReader does not suffice to create a Matcher.
You also need a Searcher which has access to statistics about the complete
collection of documents.

An IndexSearcher is a Searcher representing one index on a local machine.
Each index is made up of one or more segments, and the IndexSearcher has
access to aggregate statistics representing all the segments combined.

A PolySearcher is a Searcher made up of other Searchers.  It typically
represents a document collection spread out over multiple indexes on multiple
machines.  It aggregates the statistics of all its child Searchers.

A Compiler/Investigation is created using the combination of a Query and a
Searcher.  It inspects the Query and performs weighting based on statistics
obtained from the Searcher.  Once the weighting is finished, it is ready to
perform its role as a factory which creates Matchers from supplied SegReaders.

> I think it may turn out that some parts (QueryParser[*],
> Compiler/Investigator) are merely procedural details organized into classes,
> rather than real public facing objects (Query, Matcher).   

At some point I would like to pursue the idea of a "MatchEngine" class which
defines the mapping betwen Query and Matcher.  Right now, Query creates
Compiler and Compiler create Matcher -- but that means that in order to plug
in an alternate scoring engine, you have to subclass every Query class.

However, I do not wish to pursue MatchEngine right this moment.  The
0.1.0-incubating release needs to be our top priority.

> Marvin --- can you write up a single page overview of this?

We have something like this already -- the class documentation for Compiler.

We also have this post of yours to the KinoSearch list from 2008, which I
think shows us the kind of documentation you would like to see:

    http://www.rectangular.com/pipermail/kinosearch/2008-February/001443.html

In addition, we have the the summary I wrote above.  It omits IndexReader,
PolyReader, Collector, and all the details underneath Matcher -- but perhaps
that is for the best.

My first inclination is to preserve the association of this documentation with
Compiler/Investigation.  Query and Searcher are front-line user classes; it
would be best to avoid cluttering up their docs with advanced material.

But perhaps we should create a dedicated
"how-all-these-search-classes-fit-together" document under under Lucy::Docs
instead.

> [*] I think QueryParser illustrates my perspective.  It doesn't really
> matter to me whether there is a QueryParser object, only that there is
> a means a transform a string into a Query.  

We should probably formalize the notion that Lucy is not tied to any
particular query language. 

The canonical way to express Lucy search queries is using Query objects.
There is no syntax which is convenient enough for user search boxes which can
be used to perform lossless round-trip serialization of arbitrary Queries.
This ain't SQL.

> Do we really need an Investigator object per se, or just a class that
> contains some functions for creating a Matcher?

Exactly the issue that I feel we must postpone :) though I hope it's clear
from my response above that I'm excited about the possibilities.

Marvin Humphrey


Mime
View raw message