lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Integration of lucene in XML product
Date Thu, 14 Nov 2002 04:57:15 GMT
I don't know if I just missed it, or this email from 6.11. just showed
up in my inbox.  It would be great if you could add some class
names/file names to some of your comments... this way I am not sure
what some of the comments are referring to. :(


--- Arno de Quaasteniet <> wrote:
> Hi everybody,
> Lately I've been doing some research about further integration of
> lucene(1.2) with the product (An XML repository) of the company I
> work
> for. I was particularly interested in the following things:
> - Storing the indexes in our database instead of in a file system, so
> operations on it would be part of a transaction.
> - Integrating full text search in our XQuery implementation.
> - Storing direct references to node objects in the index instead of
> document id's to improve performance and flexibility in our
> situation.
> I though it might be interesting for you to share my experiences with
> the lucene community, and since I've got so much benefit of using
> lucene
> I have taken some time to write down my experiences.
> But first I want to give my compliments about the clean source code,
> and
> the good use of abstractions. I think the current design makes it
> very
> easy to integrate lucene in other products.
> The first approach I took was replacing the FSDirectory by a custom
> directory that stores the indexes in an OODBMS. This was about half a
> day work. I encountered some minor issues while doing this:
> * The Directory class is currently an abstract base class with no
> method
> implementations. It would have been easier to do if this was an
> interface, because then it would be possible to directly extend my
> directory implementation class from a persistent capable class in the
> object db system we use. now another indirection was necessary.
> * I found the names of the InputStream and OutputStream classes a bit
> misleading since they actually do not represent "real" streams (at
> least
> not in the same sense of the streams in the, but
> instead offer random access to the underlying store.
> * The implementation of the InputStream method has a seekInternal
> method. This method is never used and instead the underlying
> implementations use the file pointer that is kept by the inputstream
> method and seek automatically in their read operations. This was a
> bit
> misleading, wouldn't it be better if the InputStream called the
> seekInternal method? 
> * Maybe it is a good idea to change the InputStream class to an
> interface, and add BufferedInputStream class that wraps an
> InputStream
> implementation class (separation of concerns), and also implements
> the
> InputStream class. This al makes the contract for implementers of a
> store simpler. I'm aware of the performance implications (late
> binding
> vs methods that can be inlined), but I'm not sure if this would be a
> real issue, since what really matters is I/O in
> this product.
> While this approach quickly gave me a working prototype I wasn't
> really
> happy with the performance (although this could be improved by
> improving
> the random access methods of the internal blob storage I used). But
> another things was that Lucene does not allow existing indexes to be
> updated, and instead always creates new sub indexes. Since al the
> other
> indexes in our product are live I also wanted the full text indexes
> to
> have this same "live" behavior (with live I mean that all changes in
> a
> XML document are reflected directly in the indexes).
> So I took another more drastic approach: I replaced the index and
> store
> packages with code of my own that uses our own Btree indexes: 
> - Replacing the index package was not completely trivial because
> although the abstractions are very clear they are not always
> implemented
> in a way that makes it easy to replace a component by another, since
> in
> a lot of places abstract classes are used instead of interfaces.
> - We do not need the multi field functionality (This is solved in
> XQuery) so I removed that code.
> - A number of interfaces have an "iterator like behavior" with a next
> method, but in some cases it isn't necessary to call next to see the
> first item, and in other cases it was, this was a bit confusing.
> - We use long identifiers instead of integers. And our identifiers
> are
> not incremental (within a range they are, but the start point of a
> range
> depends on the physical location of an object). It is also not easy
> to
> find the highest identifier. This was a big issue since because I had
> to
> make a lot of modifications to make this work. Especially for the
> boolear scorer since the algorithm used by this class is totally not
> suitable for "random" long id's since it iterates from 0 to the
> highest
> int id. And as you know there is a big difference between the highest
> possible int id, and the highest possible long id. So I replaced this
> algorithm with another algorithm using priority queues. If someone's
> interested in the algorithm just let me know. 
> - I changed the scorer interfaces to only include a next method the
> returns a scoredoc or null. It's not necessary anymore to have a max
> doc
> id. The scorer just returns null if its ready (This is only possible
> because of the changes in the Boolean scorer I made). Scorers that do
> not produce any more results are not part anymore of the priority
> queue
> used by the Boolean scorer (The original Boolean scorer keeps calling
> sub scorers even if the will nor produce any results anymore).
> - I encountered some dead code along the lines...
> Of course there where more issues, but I forgot to keep an exact list
> of
> it. I think these where the most important.
> The second approach is the one that will be included in the new
> version
> of our product. I'm quite happy with the end result, and the
> perfomance
> is also good. The only one thing I regret is that I had to change
> quite
> some lucene code making it hard to upgrade in the future. 
> Again I want to say I was impressed by the quality of the code, and
> the
> design. Very good! 
> I hope this information can be of some use for further improvement of
> Lucene. 
> Kind regards,
> Arno de Quaasteniet
> X-Hive Corporation
> +31 (0)10 710 86 24
> P.S. While testing performance I noticed that the StandardAnalyzer
> spends a significant amount of time in a fillinstacktrace method. I
> think it uses an exception internally to signal something, replacing
> this with a test on a return value will probably speed up the process
> a
> lot.
> --
> To unsubscribe, e-mail:  
> <>
> For additional commands, e-mail:
> <>

Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message