Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Mon, 11 Apr 2011 21:07:10 -0700
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-dev@incubator.apache.org
Message-ID: <20110412040710.GA32546@rectangular.com>
References: <20110401004129.GA14002@rectangular.com> <4D9715C9.206@peknet.com>
 <20110402180310.GA13116@rectangular.com>
 <BANLkTike5xkNhm6cYxCv8cNaL3GUtxJiAA@mail.gmail.com>
 <20110403012758.GA13878@rectangular.com>
 <BANLkTinpRCAdzNsy8VLopk493HAmZ+-7EQ@mail.gmail.com>
 <20110407232933.GB31358@rectangular.com>
 <BANLkTinOEs9v=0UOQ-6nBUkFFtEMmx=84A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <BANLkTinOEs9v=0UOQ-6nBUkFFtEMmx=84A@mail.gmail.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-dev] Refining Query-to-Matcher compilation

On Sun, Apr 10, 2011 at 12:08:05PM -0700, Nathan Kurz wrote:
> > Query objects may also be weighted in a PolySearcher and then passed down into
> > a child Searcher. �It is essential that the child Searcher know that weighting
> > has already been performed and must not be performed again.
> 
> I feel this is an architectural flaw, and that the correct solution is
> that weighting should never be performed automatically.  

Unfortunately, I don't see how that could work.  Weighting isn't optional.

The process of "weighting" a query under TF/IDF involves weighting subqueries
using IDF, so that if you search for 'new york', the rare term 'york'
contributes more towards the score than the common term 'new'.  

If you don't do that, your search results aren't going to be as relevant as
they should be -- you're going to get too much 'new' and not enough 'york'.

> It's OK if the default QueryParser does the optimization, but the engine
> should run exactly the Query it's passed.  

If we discount QueryParser's Prune() method (which post-processes the Query in
a certain way but doesn't do performance optimization), QueryParser doesn't do
any optimization at all -- and I think that's the right behavior.  Query
optimization should be left to a later stage, rather than performed in the
context of the parser.

> In the same way, the weighting needs to be independent of the "engine".

OK, I see what you're getting at.  This has been a clarifying discussion.

I disagree.  Instead, I concur with Robert that weighting should be opaque and
specific to the matching engine.

> Assume there is a machine with an known index schema and a net connection:
> exactly what do we need to specify over-the-wire to get the results we want?

We need to send over a serialized Query which has already been weighted using
aggregate statistics for the entire corpus.

Right now, that means the Query must be a Lucy::Search::Compiler object.

> Because the corpus statistics are only known by the parent, to me it makes
> no sense to do the weighting on the child: 

Absolutely.  Weighting must be done at the parent level -- i.e. in the context
of the top-level Searcher.  

The Searcher defines the document collection -- across all segments in all
indexes.

> what information needs to be sent as part of the search request for a
> specific cases? We want to search ["this" AND "that"], weighted TF/IDF,
> returning top 10 scores.   What bytes form the Request that we need to send
> to the child?

To see a dump of the Compiler, take a look below my sig.  

(The full request would have num_wanted, etc.)

> Presuming we know the full corpus statistics on the parent, I think we
> can just serialize a pre-weighted query, specify the name of a Scorer
> (one that adds subqueries), and that we want only the top 10 results.
> I don't think the child needs to know whether we are using TF/IDF,
> TF/IFC, or BM25.   What am I missing?

In a sense, you're not missing anything. :)

I believe that for those three scoring models, you are correct that it's
possible to encode the required information using standard ANDQuery and
TermQuery objects.

And furthermore, I understand that because we can technically use ANDQuery and
TermQuery as containers instead of ANDCompiler and TermCompiler, you would
like us to eliminate ANDCompiler and TermCompiler, simplifying the code base.

We can't.

What I think you may be missing is that we need ANDCompiler and TermCompiler
in order to *calculate* the values that you would have us insert into ANDQuery
and TermQuery.  The complex code that performs TF/IDF weighting has to go
*somewhere* -- TermCompiler and ANDCompiler are that "somewhere".  Even if we
we were to stop using them as containers, we can't kill them off.

Marvin Humphrey


#----------------------------------------------------------------------------
# This is a dump of a Compiler for the query string 'this AND that', weighted
# using the US Constitution corpus.
#----------------------------------------------------------------------------

$VAR1 = {
  '_class' => 'Lucy::Search::ANDCompiler',
  'boost' => '1',
  'children' => [
    {
      '_class' => 'Lucy::Search::TermCompiler',
      'boost' => '1',
      'idf' => '1.81575',
      'normalized_weight' => '1.09951',
      'parent' => {
        '_class' => 'Lucy::Search::TermQuery',
        'boost' => '1',
        'field' => 'content',
        'term' => 'this'
      },
      'query_norm_factor' => '0.333494',
      'raw_weight' => '1.81575',
      'sim' => {
        '_class' => 'Lucy::Index::Similarity'
      }
    },
    {
      '_class' => 'Lucy::Search::TermCompiler',
      'boost' => '1',
      'idf' => '2.38629',
      'normalized_weight' => '1.89905',
      'parent' => {
        '_class' => 'Lucy::Search::TermQuery',
        'boost' => '1',
        'field' => 'content',
        'term' => 'that'
      },
      'query_norm_factor' => '0.333494',
      'raw_weight' => '2.38629',
      'sim' => {
        '_class' => 'Lucy::Index::Similarity'
      }
    }
  ],
  'parent' => {
    '_class' => 'Lucy::Search::ANDQuery',
    'boost' => '1',
    'children' => [
      {
        '_class' => 'Lucy::Search::TermQuery',
        'boost' => '1',
        'field' => 'content',
        'term' => 'this'
      },
      {
        '_class' => 'Lucy::Search::TermQuery',
        'boost' => '1',
        'field' => 'content',
        'term' => 'that'
      }
    ]
  },
  'sim' => {
    '_class' => 'Lucy::Index::Similarity'
  }
};