incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [lucy-dev] Refining Query-to-Matcher compilation
Date Mon, 11 Apr 2011 13:48:44 GMT
On Thu, Apr 7, 2011 at 7:29 PM, Marvin Humphrey <marvin@rectangular.com> wrote:

> When weighting an arbitrarily complex query, we have to allow the scoring
> model the option of having member variables and methods which perform the
> weighting, and we have to allow for the possibility that it will proceed in an
> arbitrary number of stages, requiring gradual modifications to complex
> internal states before collapsing down to a final "weight" -- if it ever does.
>

FYI: while I am not particularly knowledgable on lucy, I wanted to
agree with this. We (lucene-java) seem to be heading in the same
direction so I wanted to share a few things I have found.
Obviously I am not saying any of this is the best at all (especially
not knowing lucy's internals) but it might be some food for thought.

1. Its really important to tease out this scoring from the matching as
much as possible in my opinion, because if you go the route of
supporting bulk postings compression formats (e.g. PFOR), the Matcher
will become significantly more complicated. For example, you can take
a look at our "matching" for bulk postings formats here and see how
complicated it is:
http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/search/TermScorer.java.
At first I was terrified when Mike Mccandless implemented it this way,
but upon review of a lot of literature I found reference after
reference to this fact, that supporting these bulk compression formats
introduces significant complexity.

2. While its true most scoring systems have some IDF-ish type concept
of "weight" that can be precomputed up front, some scoring systems
cannot collapse to a single value up front. For this reason I think
its good for this "weight" to be opaque to the similarity
implementation: it might contain 2 or 3 stashed values (some DFR
formulations that use a "tf-itf" like formula that cannot be teased
into a single float up-front), or even contain nothing at all
(constant score?). For lucene-java though, I'm considering the idea of
allowing the opaque weight to expose a "getValue" float of some sort,
basically some estimation of the weight, which would support the query
normalization of tf/idf for example.

3. I am not sure if Lucy supports the concept of "Explanation" (the
ability to explain the score) like Lucene does, but if so, its
important to think about this in parallel: the splitting of Matching
from Scoring applies to Explanations too. In other words, in
lucene-java terms, the responsibility for explaining the score must be
factored out into Similarity too.

Below is a long explanation of the currently the in-progress design
for lucene-java (in case its useful): if you want to see the
work-in-progress have a look at
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/

1. What is "Weight" today in lucene-java' purpose instead is to
"setUp" the scoring:
      a. seek to the term, pass its statistics to the Similarity, get
back an opaque "weight"
      b. create matchers per-segment, passing the similarity and some
context (e.g. this up-front weight and the per-segment IndexReader).
2. the matcher asks the similarity for a "Scorer", I found there are
two types needed to support all lucene queries:
      a. ExactDocScorer: this one must implement 'float score(int
docid, int tf)': used for termquery, exact phrasequery, etc.
      b. SloppyDocScorer: this one must implement 'float score(int
docid, float tf)': used for sloppy phrasequery, spanqueries, etc.
      The justification behind the two is that not only are they doing
different things and might be implemented totally differently, but it
makes it easier to optimize the exact case (e.g. TFIDF can do its
score caching)
3. the similarity is a factory for these two types of scorers: it must
implement a 'createExactDocScorer' and 'createSloppyDocScorer'.
      These factory methods get the context from (1b), which allows
them to do things like pull the per-segment norms or whatever they
need up-front.

So the idea is that the base "Similarity" only supports these 3 methods:
  Weight computeWeight(top-level context);
  ExactDocScorer createExactDocScorer(Weight weight, per-segment context);
  SloppyDocScorer createSloppyDocScorer(Weight weight, per-segment context);

But a specific model (e.g. TFIDF) would implement this low-level
unfriendly API and present a nice subclassable API on top of it for
extension.

For some examples implemented on top of this:
1. lucene's traditional "TF/IDF":
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java
2. Dirichlet LM:
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java
3. BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java
4. DFR I(F)L2: https://issues.apache.org/jira/secure/attachment/12474975/LUCENE-2959_mockdfr.patch

Its messy, its a work in progress, and you can probably see some
limitations: for example as mentioned earlier until we make the weight
opaque, some formulations cannot yet be implemented. Additionally the
explanations refactoring is not yet done.

But you can see how the first two systems are optimized reasonably
well, and i experimented with other optimizations (e.g. the scorer for
TF/IDF is specialized to remove a per-document "if" for the
omitNorms=true/false).

Mime
View raw message