lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <>
Subject RE: Taking a step back
Date Thu, 11 May 2006 17:37:17 GMT
I disagree with that a bit. I have found that certain languages lend
themselves far better to certain file formats (that is, if an operation is
very efficient to perform in a particular language, using a file format that
allows the usage of that operation directly will often lead to much better
performance). This is often true with byte ordering on particular hardware
platforms. That is the whole reason this is an issue. Others can read the
modified UTF, it is just not as efficient for them !

But more importantly, I don't think Lucene (or others) should be "held back"
attempting to adhere to a standardized file format.

Take databases for example. Many available. All use different file formats,
but all can be accessed with (pretty much) standardized SQL (using different

I think Lucene could offer a similar approach at the API level, maybe an
embedded TCP/IP interface / command processor (similar to an HTTP server).

You are always going to have interoperability issues (sometimes even when
using Java, but rarely), so I say dump the burden on the others, and just
make Lucene the best Java search engine possible.

Without starting some sort of flame war, I can't think of any advantages to
not running a Java version of Lucene, but, that is just my opinion. It would
be fairly straight forward to convert all of Lucene to C, and provide a Java
binding, but why???

-----Original Message-----
From: Marvin Humphrey []
Sent: Thursday, May 11, 2006 12:08 PM
Subject: Re: Taking a step back

On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through
> version
> change cycles - otherwise all platforms need to be updated together
> - which
> is very hard to do (unless you are using Java with WORA !).
> I also don't understand why Lucene doesn't focus on being THE JAVA
> search
> engine. Anything I think that detracts that from moving forward
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a
language argument, but I think it falls to me to respond, since the
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One
unfortunate albeit unavoidable aspect of Lucene is that it is tightly
bound to its file format.  In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into
memory using a plugin, manipulated, then saved using another plugin.
That doesn't work, obviously, because indexes are commonly too large
to be read into available RAM, and so the I/O stuff is scattered over
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread.  This is currently
done using the File Formats document (though that document is
incomplete and buggy).  There's not much difference between
supporting the files written by an earlier version of Lucene and
supporting the files written by another implementation of Lucene
which adhere to the same spec.

The only question is whether there are Java-specific optimizations
which are so advantageous that they outweigh the benefits of
interchange.  There is no inherent advantage in using Modified UTF-8
over standard UTF-8, and the UTF-8 code I supplied actually speeds up
Lucene by a couple percent because it simplifies some conditionals --
all of the performance hit comes from using a bytecount as the String
prefix.  I have good reasons to believe that this can go away, not
the least of which is I've actually written a working implementation
in Perl/C which uses bytecounts and I know where all the bottlenecks

There are also advantages to keeping the file format public, both for
Java Lucene and for the larger Apache Lucene project.  Of course
there's the the raw usefulness of interchange.  For instance, it
might be nice to whip up a little script in Perl or Ruby which works
with your existing rig -- especially if there's a CPAN module that
offers functionality you need which isn't available yet in Java, or
you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations
share a common file format means that all the authors have an
amplified interest in coordinating, communicating, and contributing.
Just as learning new languages, programming or natural, broadens an
individual's horizons, so does working out an implementation based on
Lucene's data structures in another language lead to fresh thinking.
The more cross-pollination of ideas from various authors and by
proxy, their extended communities, the more all of the sub-projects
gain and the faster Apache Lucene as a whole advances.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message