lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Herou" <marcus.he...@tailsweep.com>
Subject Re: Using lucene as a database... good idea or bad idea?
Date Sun, 03 Aug 2008 12:49:21 GMT
Hi Don't use JDBM if you want a BTree it is too slow... Been there.

Use Xindice's Filer class: org.apache.xindice.core.filer.Filer and
instantiate a org.apache.xindice.core.filer.BTreeFiler which is a whole lot
faster.

JDBM and Xindice are comparable in reading speed but surely not in writing.

Check out my implementation of the BTree:
http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/xindice/TreeCache.java

Kindly

//Marcus



On Wed, Jul 30, 2008 at 10:44 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> A possible open source solution using a page based database would be to
> store the documents in http://jdbm.sourceforge.net/ which offers BTree,
> Hash, and raw page based access.  One would use a primary key type of
> persistent ID to lookup the document data from JDBM.
>
> Would be a good Lucene project to implement and I think a good solution for
> Ocean LUCENE-1313.  Storing documents in Lucene is fine but for a realtime
> search index with many documents being deleted a lot of garbage builds up.
> Frequent merging of documents files becomes IO intensive.
>
> Of course one issue with JDBM which I am not sure other SQL page based
> systems do is load individual fields directly from disk rather than load
> the
> entire page into RAM first, then load the pages. Maybe it does not matter.
>
> On Mon, Jul 28, 2008 at 9:53 PM, John Evans <john@jpevans.com> wrote:
>
> > Hi All,
> >
> > I have successfully used Lucene in the "tradtiional" way to provide
> > full-text search for various websites.  Now I am tasked with developing a
> > data-store to back a web crawler.  The crawler can be configured to
> > retrieve
> > arbitrary fields from arbitrary pages, so the result is that each
> document
> > may have a random assortment of fields.  It seems like Lucene may be a
> > natural fit for this scenario since you can obviously add arbitrary
> fields
> > to each document and you can store the actually data in the database.
> I've
> > done some research to make sure that it would meet all of our individual
> > requirements (that we can iterate over documents, update (delete/replace)
> > documents, etc.) and everything looks good.  I've also seen a couple of
> > references around the net to other people trying similar things...
> however,
> > I know it's not meant to be used this way, so I thought I would post here
> > and ask for guidance?  Has anyone done something similar?  Is there any
> > specific reason to think this is a bad idea?
> >
> > The one thing that I am least certain about his how well it will scale.
>  We
> > may reach the point where we have tens of millions of documents and a
> high
> > percentage of those documents may be relatively large (10k-50k each).  We
> > actually would NOT be expecting/needing Lucene's normal extreme fast text
> > search times for this, but we would need reasonable times for adding new
> > documents to the index, retrieving documents by ID (for iterating over
> all
> > documents), optimizing the index after a series of changes, etc.
> >
> > Any advice/input/theories anyone can contribute would be greatly
> > appreciated.
> >
> > Thanks,
> > -
> > John
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message