lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Thu, 13 Jan 2005 16:52:29 GMT
It's a good point that the aggregate idf table holds enough information
to do the rewrite()'s.  So MultiSearcher can compute the Weights, which
avoids the need to distribute the aggregate tables to the remote nodes.
It is still necessary to compute them and keep them current under index
updates on the remote nodes, for which a delta-docFreq table still seems
to me to be a good approach.

I think idf() is necessary for decent scoring / relevance-ranking and so
this is essential to do.  With Paul's observation, one complicating step
has been removed.

Chuck

  > -----Original Message-----
  > From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
  > Sent: Thursday, January 13, 2005 12:18 AM
  > To: lucene-dev@jakarta.apache.org
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > On Thursday 13 January 2005 01:19, Chuck Williams wrote:
  > > I think there is another problem here.  It is currently the Weight
  > > implementations that do rewrite(), which requires access to the
index,
  > > not just to the idf's.  E.g., RangeQuery.rewrite() must find the
terms
  > > in the index within the range.  So, the Weight cannot be computed
in
  > the
  > > MultiSearcher, as it does not have direct access to the remote
index.
  > >
  > > This seems to put the viability of the whole approach into
question.
  > > The better approach may be to distribute an aggregate docFreq
table to
  > > each remote node.  A simple interim step could be to support a
  > callback
  > > to the dispatcher node from docFreq on the remote node, although
this
  > > would be gross (remote node calls dispatcher node to get docFreq
which
  > > in turn calls all remote nodes to get all their docFreqs and sum
them).
  > >
  > > We need an aggregate docFreq table, and it needs to be on the
remote
  > > nodes since the Weight's cannot be computed until after the Query
is
  > > rewritten, which requires access to the index on the remote node.
  > 
  > An alternative is to rewrite to a central cache, which is possible
  > because
  > because it contains all terms and their total document frequencies.
  > After that all terms and their weights can be sent to the remote
  > searchers,
  > which can then drop the terms that they don't have.
  > 
  > If it is possible to send a truncated term (or a range) with a
centrally
  > determined weight to the remote searcher, this would avoid sending
all
  > terms
  > to all remote searchers.
  > In that case the remote searchers might rewrite again to
  > select only the terms they have indexed themselves.
  > 
  > The question then is whether it is possible to send the query
extended
  > with
  > weights to the remote searchers. Sounds doable to me.
  > 
  > It's losing simplicity, though. OTOH, with a replicated cache, much
the
  > same
  > thing would need to be done remotely.
  > 
  > Regards,
  > Paul Elschot.
  > 
  > P.S. Are you sure it is worthwhile to do this?
  > Term density (and it's square root tf()) vary much more than idf
  > nowadays.
  > 
  > > Chuck
  > >
  > >   > -----Original Message-----
  > >   > From: Wolf Siberski [mailto:siberski@l3s.de]
  > >   > Sent: Wednesday, January 12, 2005 4:08 PM
  > >   > To: Lucene Developers List
  > >   > Subject: Re: How to proceed with Bug 31841 - MultiSearcher
  > problems
  > > with
  > >   > Similarity.docFreq() ?
  > >   >
  > >   > Doug Cutting wrote:
  > >   > > Wolf Siberski wrote:
  > >   > >
  > >   > >> Chuck Williams wrote:
  > >   > >>
  > >   > >>> This is a nice solution!  By having MultiSearcher create
the
  > > Weight,
  > >   > it
  > >   > >>> can pass itself in as the searcher, thereby allowing the
  > correct
  > >   > >>> docFreq() method to be called.  This is similar to what I
  > tried
  > > to
  > >   > do
  > >   > >>> with topmostSearcher, but a much better way to do it.
  > >   > >>
  > >   > >> This still wouldn't work for RemoteSearchables, except if
you
  > > allow
  > >   > >> call-backs from each RemoteSearchable to the MultiSearcher.
  > >   > >
  > >   > > I don't see what callbacks are required.  When the Weight is
  > >   > constructed
  > >   > > it invokes docFreq for each term, which, if
RemoteSearchables
  > are
  > >   > > involved, will result in IPC calls to those
RemoteSearchables.
  > > Then,
  > >   > > the Weight object is serialized to each RemoteSearchable and
a
  > > TopDocs
  > >   > > is returned.  Where are the callbacks?  These are only
required
  > > for
  > >   > > HitCollector-based methods, which are not advised with
  > >   > RemoteSearchable.
  > >   >
  > >   > Yes, I agree. I just wanted to point out that the current
Weight
  > >   > implementations need to be modified heavily to introduce the
  > >   > behaviour you describe above. For example, take a look at
  > >   > TermQuery.TermWeight.scorer():
  > >   >     [...]
  > >   >     return new TermScorer(this, termDocs,
getSimilarity(searcher),
  > >   >                           reader.norms(term.field()));
  > >   >
  > >   > This typically results in a call to searcher.getSimilarity().
  > >   > In the new context, the searcher would be a MultiSearcher,
  > >   > and to resolve that call at on of the RemoteSearchables, the
  > >   > method getSimilarity() would have to be called remotely on it.
  > >   > In this case, we can change it so that the Weight is provided
  > >   > with the Similarity object before it is serialized and sent
  > >   > to the RemoteSearchables. But I'm not sure if all these cases
  > >   > can be resolved that easily. As you already have pointed out,
  > >   > it won't be possible for HitCollector-related Weights.
  > >   >
  > >   > But, as I said, I still agree fully with the approach.
  > >   >
  > >   >
  > >   >
  > >   >
  > >
---------------------------------------------------------------------
  > >   > To unsubscribe, e-mail:
lucene-dev-unsubscribe@jakarta.apache.org
  > >   > For additional commands, e-mail: lucene-dev-
  > help@jakarta.apache.org
  > >
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > > For additional commands, e-mail:
lucene-dev-help@jakarta.apache.org
  > >
  > >
  > >
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message