lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Flexible indexing
Date Wed, 14 Mar 2007 02:02:29 GMT

On Mar 13, 2007, at 2:38 AM, Michael Busch wrote:

> Global field semantics make our life with FI much easier in a  
> single index. But even with global field semantics we would have  
> the same problem with the IndexWriter.addIndexes() method, no? I'm  
> curious about how you solved that conflict in KinoSearch?

I didn't.

The KinoSearch equivalent of IndexWriter.addIndexes() fails if either  
you attempt to add an index created using a different subclass of  
Schema, or if any mismatches are detected when comparing field name  
=> spec pairings.  No conflict resolution is attempted -- only  

By committing to resolving all field property conflicts, Lucene  
creates two problems for itself.

First, there's the burden of writing, maintaining, and using the  
conflict resolution code for each property.  Sometimes this code is  
problematic, as illustrated by a Michael McCandless post to java-user  
from this morning:

   Note, however, that you must do this for all Field instances by that
   same field name because whenever Lucene merges segments, if even one
   Document did not disable norms then this will "spread" so that all
   documents keep their norms, for the same field name.

Second, Lucene limits the kinds of properties that may be attached to  
field names to those where conflict resolution is possible, and which  
may be expressed entirely via a single boolean value.  If you want to  
hang more sophisticated semantics off of field names, it is necessary  
to apply ad-hoc solutions outside the system:  
PerFieldAnalyzerWrapper, subclassing Similarity and making lengthNorm 
() polymorphic depending on field name, etc.

Things get easier to control, grok, and extend if all per-field  
behaviors are determined by a single class rather than spread out.   
An Analyzer spec can be associated with a field name permanently,  
eliminating analyzer mismatches.  So can a Similarity  
implementation... soon, a posting format.

Every feature that accumulates adds to the pressure on Lucene's  
conflict resolution system and acts as a drag on innovation (because  
we are reluctant to complicate the interface further, as Yonik was  
with segOmitNorms).  By trading away a certain amount of flexibility  
with regards to what properties may be hung off of individual field  
values, that pressure is released, and we get a simplified code base  
and increased freedom to hang a greater diversity of properties off  
of individual field names.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message