Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates
 72.14.204.234 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=RjETd9yAZmdoceHVfWCSE6pBaQGSnhV2Bor1bWJ+nWYoAlUiDEfALtP5zaLomaulwx
         QMNidH/azkHc/fc3I4ee3TEMMrqaTcktd+3fueOlVLV62dC2eKdAORZ8vTVgsS58Luei
         RiQPzKdEbJBU24fcf/uFefMGgjdi2tODETTBM=
Message-ID: <8c4e68610807290148h6595611dla895dd22e8af4548@mail.gmail.com>
Date: Tue, 29 Jul 2008 09:48:20 +0100
From: "Ian Lea" <ian.lea@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Using lucene as a database... good idea or bad idea?
In-Reply-To: <6f7ea56f0807281853k1f898f5atc6beaf209e03f609@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <6f7ea56f0807281853k1f898f5atc6beaf209e03f609@mail.gmail.com>

John


I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days).  Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling.  There are certainly lucene indexes out there bigger
than what you propose.  You can compress the stored data to save some
space.  Run times for optimization might get interesting but see
recent threads for suggestions on that.  And since you are not too
concerned about performance you may not need to optimize much, or even
at all.

Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.


--
Ian.


On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@jpevans.com> wrote:
> Hi All,
>
> I have successfully used Lucene in the "tradtiional" way to provide
> full-text search for various websites.  Now I am tasked with developing a
> data-store to back a web crawler.  The crawler can be configured to retrieve
> arbitrary fields from arbitrary pages, so the result is that each document
> may have a random assortment of fields.  It seems like Lucene may be a
> natural fit for this scenario since you can obviously add arbitrary fields
> to each document and you can store the actually data in the database. I've
> done some research to make sure that it would meet all of our individual
> requirements (that we can iterate over documents, update (delete/replace)
> documents, etc.) and everything looks good.  I've also seen a couple of
> references around the net to other people trying similar things... however,
> I know it's not meant to be used this way, so I thought I would post here
> and ask for guidance?  Has anyone done something similar?  Is there any
> specific reason to think this is a bad idea?
>
> The one thing that I am least certain about his how well it will scale.  We
> may reach the point where we have tens of millions of documents and a high
> percentage of those documents may be relatively large (10k-50k each).  We
> actually would NOT be expecting/needing Lucene's normal extreme fast text
> search times for this, but we would need reasonable times for adding new
> documents to the index, retrieving documents by ID (for iterating over all
> documents), optimizing the index after a series of changes, etc.
>
> Any advice/input/theories anyone can contribute would be greatly
> appreciated.
>
> Thanks,
> -
> John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org