lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Taking a step back
Date Thu, 11 May 2006 17:08:01 GMT

On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want  
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through  
> version
> change cycles - otherwise all platforms need to be updated together  
> - which
> is very hard to do (unless you are using Java with WORA !).
> I also don't understand why Lucene doesn't focus on being THE JAVA  
> search
> engine. Anything I think that detracts that from moving forward  
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a  
language argument, but I think it falls to me to respond, since the  
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One  
unfortunate albeit unavoidable aspect of Lucene is that it is tightly  
bound to its file format.  In a perfect world, the file reading/ 
writing apparatus would be modular: the index would be read into  
memory using a plugin, manipulated, then saved using another plugin.   
That doesn't work, obviously, because indexes are commonly too large  
to be read into available RAM, and so the I/O stuff is scattered over  
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format  
definition, so that it may live up to the commitments for backwards- 
compatibility codified earlier in this thread.  This is currently  
done using the File Formats document (though that document is  
incomplete and buggy).  There's not much difference between  
supporting the files written by an earlier version of Lucene and  
supporting the files written by another implementation of Lucene  
which adhere to the same spec.

The only question is whether there are Java-specific optimizations  
which are so advantageous that they outweigh the benefits of  
interchange.  There is no inherent advantage in using Modified UTF-8  
over standard UTF-8, and the UTF-8 code I supplied actually speeds up  
Lucene by a couple percent because it simplifies some conditionals --  
all of the performance hit comes from using a bytecount as the String  
prefix.  I have good reasons to believe that this can go away, not  
the least of which is I've actually written a working implementation  
in Perl/C which uses bytecounts and I know where all the bottlenecks  

There are also advantages to keeping the file format public, both for  
Java Lucene and for the larger Apache Lucene project.  Of course  
there's the the raw usefulness of interchange.  For instance, it  
might be nice to whip up a little script in Perl or Ruby which works  
with your existing rig -- especially if there's a CPAN module that  
offers functionality you need which isn't available yet in Java, or  
you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations  
share a common file format means that all the authors have an  
amplified interest in coordinating, communicating, and contributing.   
Just as learning new languages, programming or natural, broadens an  
individual's horizons, so does working out an implementation based on  
Lucene's data structures in another language lead to fresh thinking.   
The more cross-pollination of ideas from various authors and by  
proxy, their extended communities, the more all of the sub-projects  
gain and the faster Apache Lucene as a whole advances.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message