lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Hall <mh...@informatics.jax.org>
Subject Re: Using lucene as a database... good idea or bad idea?
Date Tue, 29 Jul 2008 19:28:52 GMT
Yeah.. we do the same thing here for indexes of up to  57M documents 
(rows), and that's just one part of our implementation.

It takes quite a bit of.. wrangling to use lucene in this manner.. but 
we've found it to be utterly worthwhile.

Matt

Ian Lea wrote:
> John
>
>
> I think it's a great idea, and do exactly this to store 5 million+
> documents with info that it takes way too long to get out of our
> Oracle database (think days).  Not as many docs as you are talking
> about, and less data for each doc, but I wouldn't have any concerns
> about scaling.  There are certainly lucene indexes out there bigger
> than what you propose.  You can compress the stored data to save some
> space.  Run times for optimization might get interesting but see
> recent threads for suggestions on that.  And since you are not too
> concerned about performance you may not need to optimize much, or even
> at all.
>
> Of course you need to remember that this is not a DBMS solution in the
> sense of transactions, recovery, etc. but I'm sure you are already
> aware of that.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@jpevans.com> wrote:
>   
>> Hi All,
>>
>> I have successfully used Lucene in the "tradtiional" way to provide
>> full-text search for various websites.  Now I am tasked with developing a
>> data-store to back a web crawler.  The crawler can be configured to retrieve
>> arbitrary fields from arbitrary pages, so the result is that each document
>> may have a random assortment of fields.  It seems like Lucene may be a
>> natural fit for this scenario since you can obviously add arbitrary fields
>> to each document and you can store the actually data in the database. I've
>> done some research to make sure that it would meet all of our individual
>> requirements (that we can iterate over documents, update (delete/replace)
>> documents, etc.) and everything looks good.  I've also seen a couple of
>> references around the net to other people trying similar things... however,
>> I know it's not meant to be used this way, so I thought I would post here
>> and ask for guidance?  Has anyone done something similar?  Is there any
>> specific reason to think this is a bad idea?
>>
>> The one thing that I am least certain about his how well it will scale.  We
>> may reach the point where we have tens of millions of documents and a high
>> percentage of those documents may be relatively large (10k-50k each).  We
>> actually would NOT be expecting/needing Lucene's normal extreme fast text
>> search times for this, but we would need reasonable times for adding new
>> documents to the index, retrieving documents by ID (for iterating over all
>> documents), optimizing the index after a series of changes, etc.
>>
>> Any advice/input/theories anyone can contribute would be greatly
>> appreciated.
>>
>> Thanks,
>> -
>> John
>>
>>     
>
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message