lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Moving SweetSpotSimilarity out of contrib
Date Fri, 05 Sep 2008 10:41:48 GMT

Chris Hostetter wrote:

> : Another important driver is the "out-of-the-box experience".
>
> I honestly have no idea what an OOTB experience for Lucene-Java  
> means ...
> For Solr i understand, For Nutch i understand ... for a java  
> library????

Well... even though it's a "java library", Lucene still has many
defaults.

Sure, Solr has even more, so this is important for Solr too.

Most non-Solr apps built on Lucene will simply use Lucene's defaults,
for lack of knowing any better.

How well such apps then work is what I'm calling the OOTB experience
for Lucene, and I think it's well-defined and important.

Especially spooky is when a publication does an eval of search
libraries because typically they will eval only the OOTB experience and
won't go looking on our wiki to discover all the tricks.

With IndexWriter we default to flushing by RAM usage (16 MB) not by
buffered doc count, to ConcurrentMergeScheduler, to
LogByteSizeMergePolicy, to compound file format, mergeFactor is 10,
etc.

IndexSearcher (and also IndexWriter, for lengthNorm) uses
Similarity.getDefault().

QueryParser uses a number of defaults when translating the end user's
search text into all sorts of Query instances.

In 2.3 we made great improvements to OOTB indexing speed, and that's
important.

I think making improvements to OOTB relevance is also important, but I
agree this is much harder to do "in general" since there are so many
differences between the content in apps.

That all being said... I also agree (on closer inspection) it's not
cut and dry that SSS is a good choice for default (what would be the
right default for its "curve"?).

If other OOTB relevance improvements surface with time (eg a good way
to do passage scoring/retrieval or proximity scoring or lexical
affinity) then we should strongly consider them.  Such things always
come with a performance cost, though, so it'll be an interesting
discussion...

> Butthen we get into that back-compat concern issue.

Well...is Lucene's precise scoring formula guaranteed not to change
between releases?  I assume and hope not.

Just like with indexing, where the precise choice of when committing
and merging and flushing happens was never "promised", that lack of
API promise gave us the freedom to drastically improve the OOTB
indexing speed without breaking any promises.  We need to keep that
same freedom on the search side.

 From our last discussion on back compat, our most powerful weapon is
to NOT make promises when they aren't necessary or could limit future
back compat.

And, if we have a back compat situation that's holding back Lucene's
OOTB adoption by new users, we should think hard about switching the
default to favor new users and making an option to quickly get back to
the old behavior to accomodate existing users.  The recent bug fixes
to StandardTokenizer are such examples.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message