lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arno de Quaasteniet" <a...@x-hive.com>
Subject Integration of lucene in XML product
Date Wed, 06 Nov 2002 09:11:01 GMT
Hi everybody,

Lately I've been doing some research about further integration of
lucene(1.2) with the product (An XML repository) of the company I work
for. I was particularly interested in the following things:
- Storing the indexes in our database instead of in a file system, so
operations on it would be part of a transaction.
- Integrating full text search in our XQuery implementation.
- Storing direct references to node objects in the index instead of
document id's to improve performance and flexibility in our situation.

I though it might be interesting for you to share my experiences with
the lucene community, and since I've got so much benefit of using lucene
I have taken some time to write down my experiences.

But first I want to give my compliments about the clean source code, and
the good use of abstractions. I think the current design makes it very
easy to integrate lucene in other products.

The first approach I took was replacing the FSDirectory by a custom
directory that stores the indexes in an OODBMS. This was about half a
day work. I encountered some minor issues while doing this:

* The Directory class is currently an abstract base class with no method
implementations. It would have been easier to do if this was an
interface, because then it would be possible to directly extend my
directory implementation class from a persistent capable class in the
object db system we use. now another indirection was necessary.

* I found the names of the InputStream and OutputStream classes a bit
misleading since they actually do not represent "real" streams (at least
not in the same sense of the streams in the java.io.package), but
instead offer random access to the underlying store.

* The implementation of the InputStream method has a seekInternal
method. This method is never used and instead the underlying
implementations use the file pointer that is kept by the inputstream
method and seek automatically in their read operations. This was a bit
misleading, wouldn't it be better if the InputStream called the
seekInternal method? 

* Maybe it is a good idea to change the InputStream class to an
interface, and add BufferedInputStream class that wraps an InputStream
implementation class (separation of concerns), and also implements the
InputStream class. This al makes the contract for implementers of a
store simpler. I'm aware of the performance implications (late binding
vs methods that can be inlined), but I'm not sure if this would be a
real issue, since what really matters is I/O in
this product.

While this approach quickly gave me a working prototype I wasn't really
happy with the performance (although this could be improved by improving
the random access methods of the internal blob storage I used). But
another things was that Lucene does not allow existing indexes to be
updated, and instead always creates new sub indexes. Since al the other
indexes in our product are live I also wanted the full text indexes to
have this same "live" behavior (with live I mean that all changes in a
XML document are reflected directly in the indexes).

So I took another more drastic approach: I replaced the index and store
packages with code of my own that uses our own Btree indexes: 
- Replacing the index package was not completely trivial because
although the abstractions are very clear they are not always implemented
in a way that makes it easy to replace a component by another, since in
a lot of places abstract classes are used instead of interfaces.
- We do not need the multi field functionality (This is solved in
XQuery) so I removed that code.
- A number of interfaces have an "iterator like behavior" with a next
method, but in some cases it isn't necessary to call next to see the
first item, and in other cases it was, this was a bit confusing.
- We use long identifiers instead of integers. And our identifiers are
not incremental (within a range they are, but the start point of a range
depends on the physical location of an object). It is also not easy to
find the highest identifier. This was a big issue since because I had to
make a lot of modifications to make this work. Especially for the
boolear scorer since the algorithm used by this class is totally not
suitable for "random" long id's since it iterates from 0 to the highest
int id. And as you know there is a big difference between the highest
possible int id, and the highest possible long id. So I replaced this
algorithm with another algorithm using priority queues. If someone's
interested in the algorithm just let me know. 
- I changed the scorer interfaces to only include a next method the
returns a scoredoc or null. It's not necessary anymore to have a max doc
id. The scorer just returns null if its ready (This is only possible
because of the changes in the Boolean scorer I made). Scorers that do
not produce any more results are not part anymore of the priority queue
used by the Boolean scorer (The original Boolean scorer keeps calling
sub scorers even if the will nor produce any results anymore).
- I encountered some dead code along the lines...

Of course there where more issues, but I forgot to keep an exact list of
it. I think these where the most important.

The second approach is the one that will be included in the new version
of our product. I'm quite happy with the end result, and the perfomance
is also good. The only one thing I regret is that I had to change quite
some lucene code making it hard to upgrade in the future. 

Again I want to say I was impressed by the quality of the code, and the
design. Very good! 

I hope this information can be of some use for further improvement of
Lucene. 

Kind regards,

Arno de Quaasteniet
X-Hive Corporation
+31 (0)10 710 86 24
http://www.x-hive.com
arno@x-hive.com

P.S. While testing performance I noticed that the StandardAnalyzer
spends a significant amount of time in a fillinstacktrace method. I
think it uses an exception internally to signal something, replacing
this with a test on a return value will probably speed up the process a
lot.


 


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message