lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Using lucene as a database... good idea or bad idea?
Date Tue, 29 Jul 2008 12:31:29 GMT
I think the answer is it can be done and probably quite well.  I also  
think it's informative that Nutch does not use Lucene for this  
function, as I understand it, but that shouldn't stop you either.  You  
might also have a look at Apache Jackrabbit, which uses Lucene  
underneath as a content repository.


On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:

> Hello all,
> I am also interested in this. I want to archive the content of the  
> document using Lucene.
> Is it a good idea to use Lucene as storage engine?
> Regards
> Ganesh
> ----- Original Message ----- From: "Ian Lea" <>
> To: <>
> Sent: Tuesday, July 29, 2008 2:18 PM
> Subject: Re: Using lucene as a database... good idea or bad idea?
>> John
>> I think it's a great idea, and do exactly this to store 5 million+
>> documents with info that it takes way too long to get out of our
>> Oracle database (think days).  Not as many docs as you are talking
>> about, and less data for each doc, but I wouldn't have any concerns
>> about scaling.  There are certainly lucene indexes out there bigger
>> than what you propose.  You can compress the stored data to save some
>> space.  Run times for optimization might get interesting but see
>> recent threads for suggestions on that.  And since you are not too
>> concerned about performance you may not need to optimize much, or  
>> even
>> at all.
>> Of course you need to remember that this is not a DBMS solution in  
>> the
>> sense of transactions, recovery, etc. but I'm sure you are already
>> aware of that.
>> --
>> Ian.
>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <> wrote:
>>> Hi All,
>>> I have successfully used Lucene in the "tradtiional" way to provide
>>> full-text search for various websites.  Now I am tasked with  
>>> developing a
>>> data-store to back a web crawler.  The crawler can be configured  
>>> to retrieve
>>> arbitrary fields from arbitrary pages, so the result is that each  
>>> document
>>> may have a random assortment of fields.  It seems like Lucene may  
>>> be a
>>> natural fit for this scenario since you can obviously add  
>>> arbitrary fields
>>> to each document and you can store the actually data in the  
>>> database. I've
>>> done some research to make sure that it would meet all of our  
>>> individual
>>> requirements (that we can iterate over documents, update (delete/ 
>>> replace)
>>> documents, etc.) and everything looks good.  I've also seen a  
>>> couple of
>>> references around the net to other people trying similar things...  
>>> however,
>>> I know it's not meant to be used this way, so I thought I would  
>>> post here
>>> and ask for guidance?  Has anyone done something similar?  Is  
>>> there any
>>> specific reason to think this is a bad idea?
>>> The one thing that I am least certain about his how well it will  
>>> scale. We
>>> may reach the point where we have tens of millions of documents  
>>> and a high
>>> percentage of those documents may be relatively large (10k-50k  
>>> each).  We
>>> actually would NOT be expecting/needing Lucene's normal extreme  
>>> fast text
>>> search times for this, but we would need reasonable times for  
>>> adding new
>>> documents to the index, retrieving documents by ID (for iterating  
>>> over all
>>> documents), optimizing the index after a series of changes, etc.
>>> Any advice/input/theories anyone can contribute would be greatly
>>> appreciated.
>>> Thanks,
>>> -
>>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> Send instant messages to your online friends
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message