lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 11 Jan 2005 19:01:14 GMT
I agree that this bug is important to fix, but don't believe we have a
solid fix yet.  Idf-normalization is essential to get correct for large
distributed-index apps.  I have a client evaluating Lucene for this now.
As Wolf does, I hope a committer with deep knowledge of Lucene's design
in this area will weigh in on the issue and help to resolve it.

I've read through Wolf's patch and see a few issues (please correct
anything wrong here):
  1.  DfMapSimilarity works only with a limited set of queries.  A
complete solution should support all Query types, and certainly must
support fundamental Query types like RangeQuery.  Could this be
addressed by using primitive queries rather than surface queries (i.e.,
after rewriting)?  There may be a more fundamental issue for Query's
that generate large numbers of clauses, because it is very inefficient
to go access all the RemoteSearchable's for each Term.
  2.  The patch hardwires the use of DfMapSimilarity into MultiSearcher.
As Wolf points out in his comments, this needs to be configurable.  At
present, it would be impossible to use a custom Similarity, e.g. to
change the numerical computation of idf() from the docfreq.  The ability
to configure custom Similarity's needs to be robust in the presence of
MultiSearcher, i.e. an application should be able to make the kinds of
changes currently made in a subclass of DefaultSimilarity while
inheriting the behavior that makes it work properly with MultiSearcher.
  3.  Philosophically, I'm not convinced that Similarity's are the right
solution.  Similarity's are currently used for application-specific
scoring customizations.  The issue here is idf-normalization in the
presence of multiple searchers, which should be an orthogonal
consideration.

My patch with a topmostSearcher field also has issues, especially the
fatal problem that it doesn't work for RemoteSearchable's.

A burning question for me is, what is the right solution for
RemoteSearchable's?  With Wolf's patch, the MultiSearcher analyzes each
Query to identify the terms it uses and then calls each RemoteSearchable
to get the docFreq's from its index, sums them, extends the Query with a
Map of these sums (within a created Similarity), and then passes this
information back to the RemoteSearchable's to use during their scoring.

An alternative approach would be to precompute the docFreq sums and
distribute them to all the RemoteSearchable's ahead of time, independent
of Query's.  Incremental indexing would need to recompute and propagate
the revised sums.  Having the sums pre-distributed would make
Query-processing efficient.  Is something along those lines possible?

Chuck

  > -----Original Message-----
  > From: Wolf Siberski [mailto:siberski@l3s.de]
  > Sent: Tuesday, January 11, 2005 12:55 AM
  > To: Lucene Developers List
  > Subject: How to proceed with Bug 31841 - MultiSearcher problems with
  > Similarity.docFreq() ?
  > 
  > As I'm very interested in resolving this bug,
  > I would like to resume the discussion about it.
  > Chuck Williams (the original bug reporter) and me
  > both already have provided a patch. Is any of the
  > committers willing to review them?
  > If changes are necessary, or another way of handling
  > this issue turns out to be more appropriate, I would
  > gladly put more work into that area.
  > But I need the support of (at least) one committer, and
  > also IMHO some additional discussion about how to tackle
  > that issue wouldn't hurt, too.
  > 
  > --Wolf
  > 
  > 
  > bugzilla@apache.org wrote:
  > > DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG*
  > > RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
  > > <http://issues.apache.org/bugzilla/show_bug.cgi?id=31841>.
  > > ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND*
  > > INSERTED IN THE BUG DATABASE.
  > >
  > > http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
  > >
  > >
  > > daniel.naber@t-online.de changed:
  > >
  > >            What    |Removed                     |Added
  > >
----------------------------------------------------------------------
  > ------
  > >                  CC|
|feigao@sohu-inc.com
  > >
  > >
  > >
  > >
  > > ------- Additional Comments From daniel.naber@t-online.de
2005-01-04
  > 23:49 -------
  > > *** Bug 32053 has been marked as a duplicate of this bug. ***
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message