lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hasan Diwan" <hasan.di...@gmail.com>
Subject Re: Using lucene as a database... good idea or bad idea?
Date Tue, 29 Jul 2008 04:55:48 GMT
Check the nutch or solr projects, both of which are subprojects of lucene. Feel free to drop
me a line if you should run into difficulties.
Sent via BlackBerry by AT&T

-----Original Message-----
From: "John Evans" <john@jpevans.com>

Date: Mon, 28 Jul 2008 18:53:08 
To: <java-user@lucene.apache.org>
Subject: Using lucene as a database... good idea or bad idea?


Hi All,

I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites.  Now I am tasked with developing a
data-store to back a web crawler.  The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields.  It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good.  I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance?  Has anyone done something similar?  Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale.  We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).  We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
appreciated.

Thanks,
-
John

Mime
View raw message