jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Boston <...@tfd.co.uk>
Subject Re: Jackrabbit Scalability / Performance
Date Sun, 29 Apr 2007 09:41:54 GMT
Bertrand Delacretaz wrote:
> On 4/28/07, Christoph Kiehl <christoph@sulu3000.de> wrote:
> 
>> ...Our current solution is to shutdown the
>> repository for a short time start the rdbms backup and copy the index 
>> files.
>> When index file copying is finished we startup the repository again...
> 
> Note that the Lucene-based Solr indexer
> (http://lucene.apache.org/solr/) has a clever way of allowing online
> backups of Lucene indexes, without having to stop anything (or for a
> very short time only).
> 
> In short, it works like this:
> 
> -Solr can be configured to launch a "snapshotter" script at a point in
> time when it's not writing anything to the index.
> 
> -The script takes a snapshot of the index files using hard links
> (won't work on Windows AFAIK), which is very quick on Unixish
> platforms.
> 
> -Solr waits until the script is done (a few milliseconds I guess) and
> resumes indexing.
> 
> -Another asynchronous backup script can then copy the snapshot
> anywhere, from the hard linked files, without disturbing Solr.
> 
> This won't help for the RDBMS part, but implementing something similar
> might help for online backups of index files.
> 
> See http://wiki.apache.org/solr/CollectionDistribution for more
> details - the main goal described there is index replication, but it
> obviously works for backups as well.
> 
> -Bertrand

Slightly off thread, but relevant to index backup

-----
Sakai has been using Lucene to provide search indexes in a cluster, we 
have been using a realtime index distribution mechanism where all nodes 
can take part in the indexing an all nodes can take part in the search 
delivery. With minor modifications it can work as an indexing farm and 
searching farm.

It uses a shim just below the index open/close that manages updates to 
the clusters local disks just below the IndexReaders and IndexWriters.

I looked at Nutch and the nutch file system at the time and 
unfortunately we had to reject it because, like Solr it required Unix 
setup and system commands and we needed a 100% java solution that worked 
  out of the box.

It doesn't do Map Reduce, but it does put the indexes locally and all 
the nodes are up and running all the time.

The relevant parts of the code tree can be found at

The index factory

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java

And the distribution management

which puts segments on zipped form on a shared location, either DB or 
Filesystem

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java


Is definitely not a perfect solution, and I can see it need lots of 
improvement, but it works in production.

If Jackrabbit looks really good in a cluster (which I am expecting), we 
may start putting the indexes directly in jackrabbit and let it manage 
the distribution, they are not that big in most cases, generally < 10G. 
(The total data set being indexed go will up to 1TB at some Universities)


The main point being, the central location provides a  convenient place 
for consistent backups of the index (perhaps it overkill)

Ian


Mime
View raw message