lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] Refining Query-to-Matcher compilation
Date Sun, 10 Apr 2011 20:01:03 GMT
On Sat, Apr 09, 2011 at 11:59:11PM -0700, Nathan Kurz wrote:
> I hadn't seen TF/ICF before:
> I don't yet understand what it's doing differently than TF/IDF.  Is it
> that it's counting the number of documents that use a term rather than
> the number of term occurrences?

I haven't studied TF/ICF as a system in depth, but the difference between IDF
and ICF seems easy to understand.  It rests on the distinction between
"document frequency" and "corpus frequency".  

  document frequency: The number of documents that a term occurs in.
  corpus frequency:   The number of times a term occurs in the corpus.
Say that you have two documents: "foo bar" and "foo foo foo bar".  The
document frequency for "foo" is 2, but its corpus frequency is 4.

At present, we only track document frequency, but it's at least easy in theory
to track corpus frequency.  If we were to weight terms based on ICF rather
than IDF, we still need to know numbers for the entire collection (represented
by a Searcher) and not just for a particular segment (represented by a
> Are there other scoring methods that you anticipate as useful? 

I've long wanted to work up an RTree implementation.

> What other corpus-wide data they would require?  What other corpus wide data
> exists?

I'm not sure.

> > When weighting an arbitrarily complex query, we have to allow the scoring
> > model the option of having member variables and methods which perform the
> > weighting, and we have to allow for the possibility that it will proceed in an
> > arbitrary number of stages, requiring gradual modifications to complex
> > internal states before collapsing down to a final "weight" -- if it ever does.
> Does your "if ever" imply that we indeed should try to support scorers
> that might return additional information beyond a single float, such
> as field name, position data, or matched string?  

That's the gist.  

We start in the top-level Searcher with a Query object, either supplied
directly by the user or implicitly parsed from their supplied query string.
We then weight the Query, producing a new Query with its "weighted" attribute
set to "true".

If the top-level Searcher is a single-index IndexSearcher, then the weighted
Query is used to generate one Matcher for each segment.

If the top-level Searcher is a PolySearcher, the weighted Query is passed down
to each child Searcher.  Eventually, you drill down to an IndexSearcher where
the weighted Query is used to generate Matchers, one per segment.

The weighted Query may serve as a vessel for arbitrary information augmenting
the original Query.  It would be artificially constraining for us to limit the
auxiliary data to a single float.

> I'd like to be able to do this, but don't see an easy framework.

We're doing it now -- the Compiler object is our weighted Query.

> Also, do you feel a Scorer needs to be able to do "incremental"
> scoring, or is it OK if scoring is only possible after a Matcher has
> finished?  

(Nit: we don't have a Scorer class any more -- Matcher replaced it.)

At present, the mechanism we use to collect hits only supports "incremental"
scoring, but an alternative mechanism is theoretically possible if we extend
Searcher and/or Matcher and/or Collector.

> Essentially, will it ever be necessary to score a subquery so that a Matcher
> can decide whether to skip to the next document?

I'm not sure I grok completely, but I think the answer is yes.  

Term frequency for each doc id is stored right alongside the doc id in the
same file.  Once you move beyond the doc id, you've "lost" the opportunity to
use its associated term frequency.

Say you're searching for "foo AND bar".  You can't see whether both "foo" and
"bar" match before calculating a score.  

Instead, you have to see whether "foo" matches and calculate a score
incidentally, then see whether "bar" matches and calculate a score
incidentally, and then if both match, you calculate an aggregate score using
the scores for the "foo" and "bar" subqueries.

Marvin Humphrey

View raw message