Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 93100 invoked from network); 29 Jul 2008 08:48:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Jul 2008 08:48:57 -0000 Received: (qmail 9235 invoked by uid 500); 29 Jul 2008 08:48:50 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 9201 invoked by uid 500); 29 Jul 2008 08:48:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 9190 invoked by uid 99); 29 Jul 2008 08:48:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Jul 2008 01:48:50 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates 72.14.204.234 as permitted sender) Received: from [72.14.204.234] (HELO qb-out-0506.google.com) (72.14.204.234) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Jul 2008 08:47:55 +0000 Received: by qb-out-0506.google.com with SMTP id e6so5225231qbe.27 for ; Tue, 29 Jul 2008 01:48:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=mZVMeoAwmF8OqPjpDB5vrsmZ7fk34xUSq/z8yRdfgIw=; b=nByEnjedUO4yQ00MYk/ttkRu97f6SA70pcJOXnHZa6OMXB6CZGK7U4cuW8wEXf+H+r /Ib8+UiNdUHGRbr+1uE5/oPrMqLo6lzR4UB1xq7gLL/zuUf/n1EeaVFLIrG9KSkc3wDL 2za6Nq5KNz1xFhJuTBzV3Jx6e0032MY+/L9pA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=RjETd9yAZmdoceHVfWCSE6pBaQGSnhV2Bor1bWJ+nWYoAlUiDEfALtP5zaLomaulwx QMNidH/azkHc/fc3I4ee3TEMMrqaTcktd+3fueOlVLV62dC2eKdAORZ8vTVgsS58Luei RiQPzKdEbJBU24fcf/uFefMGgjdi2tODETTBM= Received: by 10.114.53.18 with SMTP id b18mr2674895waa.141.1217321300454; Tue, 29 Jul 2008 01:48:20 -0700 (PDT) Received: by 10.114.15.5 with HTTP; Tue, 29 Jul 2008 01:48:20 -0700 (PDT) Message-ID: <8c4e68610807290148h6595611dla895dd22e8af4548@mail.gmail.com> Date: Tue, 29 Jul 2008 09:48:20 +0100 From: "Ian Lea" To: java-user@lucene.apache.org Subject: Re: Using lucene as a database... good idea or bad idea? In-Reply-To: <6f7ea56f0807281853k1f898f5atc6beaf209e03f609@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <6f7ea56f0807281853k1f898f5atc6beaf209e03f609@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org John I think it's a great idea, and do exactly this to store 5 million+ documents with info that it takes way too long to get out of our Oracle database (think days). Not as many docs as you are talking about, and less data for each doc, but I wouldn't have any concerns about scaling. There are certainly lucene indexes out there bigger than what you propose. You can compress the stored data to save some space. Run times for optimization might get interesting but see recent threads for suggestions on that. And since you are not too concerned about performance you may not need to optimize much, or even at all. Of course you need to remember that this is not a DBMS solution in the sense of transactions, recovery, etc. but I'm sure you are already aware of that. -- Ian. On Tue, Jul 29, 2008 at 2:53 AM, John Evans wrote: > Hi All, > > I have successfully used Lucene in the "tradtiional" way to provide > full-text search for various websites. Now I am tasked with developing a > data-store to back a web crawler. The crawler can be configured to retrieve > arbitrary fields from arbitrary pages, so the result is that each document > may have a random assortment of fields. It seems like Lucene may be a > natural fit for this scenario since you can obviously add arbitrary fields > to each document and you can store the actually data in the database. I've > done some research to make sure that it would meet all of our individual > requirements (that we can iterate over documents, update (delete/replace) > documents, etc.) and everything looks good. I've also seen a couple of > references around the net to other people trying similar things... however, > I know it's not meant to be used this way, so I thought I would post here > and ask for guidance? Has anyone done something similar? Is there any > specific reason to think this is a bad idea? > > The one thing that I am least certain about his how well it will scale. We > may reach the point where we have tens of millions of documents and a high > percentage of those documents may be relatively large (10k-50k each). We > actually would NOT be expecting/needing Lucene's normal extreme fast text > search times for this, but we would need reasonable times for adding new > documents to the index, retrieving documents by ID (for iterating over all > documents), optimizing the index after a series of changes, etc. > > Any advice/input/theories anyone can contribute would be greatly > appreciated. > > Thanks, > - > John > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org