jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Klimetschek <aklim...@adobe.com>
Subject Re: Functionality to store indexes in database with jackrabbit 2.1.2 or upcoming releases.........
Date Mon, 29 Nov 2010 12:06:06 GMT
On 29.11.10 12:54, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:

>On Mon, Nov 29, 2010 at 11:26 AM, Alexander Klimetschek
><aklimets@adobe.com> wrote:
>> But what is the use-case for this? Why store a full-text index
>> implementation that is totally unrelated to the DB inside a database
>> that just makes it perform worse, use more disk-space, etc.? It's like
>> implementing a database index by storing it in another database...
>Exactly! But you miss one crucial thing: First of all, the Lucene
>index should be tens of times smaller than it currently is. This is
>possible if we make it better configurable. Secondly, performance
>isn't worse, as the entire Lucene indexes are kept in memory. But the
>crucial part is in scalability: In a clustered setup, you can with
>infinispan (formerly jboss cache) have a replicated in memory Lucene
>index. This means, only one node in the cluster needs to do the
>indexing. The other nodes get it replicated. Now, because it is all in
>memory, 2 or 3 cluster nodes can for example be assigned to now and
>then flush their (new) in memory segments to a database: This is just
>a 'backup' for when the entire cluster goes down. It is not used by
>Lucene, only for bootstrapping when starting the cluster. So, this
>scenario does add lots of potential; Bringing in a new node in the
>cluster is instant. Hibernate with very similar needs as jackrabbit
>uses this technique I just described.
>As a bonus we might get rid of the database persisted changelog (or
>how it is called): This is meant for nodes in a cluster to
>a) Evict their caches
>b) Index new nodes
>(b) is not needed any more as we have index replication.
>(a) could be replaced by jms which seems more natural to me.
>The only drawback is that the current jr lucene impl does not fit the
>InfinispanDirectory (infinispan lucene dir). It is because of the
>multi-index and never re-open setup in jr: It was state of the art
>against lucene 1.4, but now mostly redundant.

Just one node doing the indexing sounds interesting. But I would then
think we store the index inside the repository (as a randomly-accessible
binary), so that you can use any persistence manager and the
implementation is simpler (no need to adapt to the various databases).

We had some plans to do something like this with additional indexes
(calling them "collections") that are created by the application side, but
store inside the repository. And implemented by Lucene (especially for the
full-text part).

The idea here is to overcome the problem of the single-big index for the
entire repository that is mandated by the JCR spec. You often want indexes
that are only for part of a repository (e.g. /content/siteA) and are
asynchronous (not blocking other repository writes) and can be more easily
thrown away, updated etc. without breaking core repository functionality.

>Anyway, in due time we need to pick this up at the dev list

Of course.


Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

View raw message