Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 49201 invoked from network); 22 Nov 2010 12:54:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Nov 2010 12:54:23 -0000 Received: (qmail 81932 invoked by uid 500); 22 Nov 2010 12:54:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 81517 invoked by uid 500); 22 Nov 2010 12:54:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 81509 invoked by uid 99); 22 Nov 2010 12:54:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 12:54:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.216.169 as permitted sender) Received: from [209.85.216.169] (HELO mail-qy0-f169.google.com) (209.85.216.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 12:54:47 +0000 Received: by qyk4 with SMTP id 4so198419qyk.14 for ; Mon, 22 Nov 2010 04:54:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=3ltyrdyXjQuFavt/SRKepHg5aHg3f4tyawYSE6VPseQ=; b=xdzuQLfKPi1Gf7777g1Wr74mlzv4HzRpnuYOUo5Qwpw607Vy8nA0HWCIrU1NP5r6rJ RSN4FNjnNcMwNwi1OJZNbQvTkPiTEoeFQZBdd1kRJAZ3BuvjfKSILT9c4Ul0CG2LOfQG jKPB8vqVB72JN9MPCxJGAGRSecayXAgebcUZk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=wkK8PhQ2qby4gGkKwS2nbH1wbV965N5t5GISOsGYmGUFztsYMXi45HyPxiuzPtgdPV 5oOlUkzaJu/Cebr8Iq300U/c55tCsMb1UkZRTQ1UnnD87UasAkCSD78ReL7eDNtF9D2g C6RZHD1RwOCzHS6nwoAjE//5FjWn5rHH/UWh8= MIME-Version: 1.0 Received: by 10.224.32.165 with SMTP id c37mr5120965qad.97.1290430465812; Mon, 22 Nov 2010 04:54:25 -0800 (PST) Received: by 10.220.200.4 with HTTP; Mon, 22 Nov 2010 04:54:25 -0800 (PST) In-Reply-To: References: Date: Mon, 22 Nov 2010 07:54:25 -0500 Message-ID: Subject: Re: best practice: 1.4 billions documents From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd6ec1285abd50495a3c1d5 --000e0cd6ec1285abd50495a3c1d5 Content-Type: text/plain; charset=ISO-8859-1 Are you looking at Solr? It has a lot of the infrastructure you'll be building yourself for Lucene already built in. Including replication, distributed searching, etc. Yes, there's a learning curve for something new, but your Lucene experience will help you a LOT with that. It has support for sharding (which is what you'll certainly have to do to handle your billion+ documents). Don't re-invent the wheel!! In conjunction, see SolrJ which provides you a java interface to Solr which may come in handy. Start here: http://wiki.apache.org/solr/ Best Erick On Mon, Nov 22, 2010 at 1:46 AM, Luca Rondanini wrote: > Hi David, thanks for your answer. it really helped a lot! so, you have an > index with more than 2 billions segments. this is pretty much the answer I > was searching for: lucene alone is able to manage such a big index. > > which kind of problems do you have with the parallel searchers? I'm going > to > build my index in the next couple of weeks if you want we can confront our > data > > thanks again > Luca > > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig wrote: > > > Actually I've been bitten by an still-unresolved issue with the parallel > > searchers and recommend a MultiReader instead. > > We have a couple billion docs in our archives as well. Breaking them up > by > > day worked well for us, but you'll need to do something. > > > > -----Original Message----- > > From: Luca Rondanini [mailto:luca.rondanini@gmail.com] > > Sent: Sunday, November 21, 2010 8:13 PM > > To: java-user@lucene.apache.org; yonik@lucidimagination.com > > Subject: Re: best practice: 1.4 billions documents > > > > thank you both! > > > > Johannes, katta seems interesting but I will need to solve the problems > of > > "hot" updates to the index > > > > Yonik, I see your point - so your suggestion would be to build an > > architecture based on ParallelMultiSearcher? > > > > > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley < > yonik@lucidimagination.com > > >wrote: > > > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini > > > wrote: > > > > Hi everybody, > > > > > > > > I really need some good advice! I need to index in lucene something > > like > > > 1.4 > > > > billions documents. I had experience in lucene but I've never worked > > with > > > > such a big number of documents. Also this is just the number of docs > at > > > > "start-up": they are going to grow and fast. > > > > > > > > I don't have to tell you that I need the system to be fast and to > > support > > > > real time updates to the documents > > > > > > > > The first solution that came to my mind was to use > > ParallelMultiSearcher, > > > > splitting the index into many "sub-index" (how many docs per index? > > > > 100,000?) but I don't have experience with it and I don't know how > well > > > will > > > > scale while the number of documents grows! > > > > > > > > A more solid solution seems to build some kind of integration with > > > hadoop. > > > > But I didn't find match about lucene and hadoop integration. > > > > > > > > Any idea? Which direction should I go (pure lucene or hadoop)? > > > > > > There seems to be a common misconception about hadoop regarding search. > > > Map-reduce as implemented in hadoop is really for batch oriented jobs > > > only (or those types of jobs where you don't need a quick response > > > time). It's definitely not for normal queries (unless you have > > > unusual requirements). > > > > > > -Yonik > > > http://www.lucidimagination.com > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > --000e0cd6ec1285abd50495a3c1d5--