lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava Imeshev <imes...@yahoo.com>
Subject RE: Infrastructure for large Lucene index
Date Fri, 06 Oct 2006 21:09:40 GMT
--- James <james@ryley.com> wrote:
> We have both.  Although we have multiple collections, our largest
> collections are still way too big for one machine.  You have to come up with
> a scheme to randomly split documents across multiple servers (randomly so
> that word frequency issues hopefully don't mean one index is getting pounded
> while others are inactive for certain searches).

Yes, this is valid concern.

Slava


> 
> Sincerely,
> James
> 
> > -----Original Message-----
> > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > Sent: Friday, October 06, 2006 4:33 PM
> > To: general@lucene.apache.org
> > Subject: RE: Infrastructure for large Lucene index
> > 
> > James,
> > 
> > --- James <james@ryley.com> wrote:
> > > We currently do this across many machines for
> > > http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across
> > our
> > > various collections, even larger than you need.  We use Remote
> > > ParalellMultiSearcher, with some custom modifications (and we are in the
> > > process of making more) to allow most robust handling of many processes
> > at
> > 
> > I am not sure if ParalellMultiSearcher going to help here because we have
> > a large uniform index, not a set of collections.
> > 
> > > once and integration of the responses from various sub-indexes.  This
> > works
> > > fine on commodity hardware, and you will be IO bound, so get multiple
> > drives
> > > in each machine.
> > >
> > > Out of curiosity, what project are you working on?  That's a lot of
> > hits!
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > > www.FreePatentsOnline.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > > Sent: Friday, October 06, 2006 2:28 PM
> > > > To: general@lucene.apache.org
> > > > Subject: Infrastructure for large Lucene index
> > > >
> > > >
> > > > I am dealing with pretty challenging task, so I thought it would be
> > > > a good idea to ask community before I re-invent any wheels of my own.
> > > >
> > > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > > index going to be read very aggresively (10s of millions  requests
> > > > per day) with some occasional updates (10 batches per day).
> > > >
> > > > The idea is to split load between multiple server nodes running Lucene
> > > > on *nix while accessing the same index that is shared across the
> > network.
> > > >
> > > > I am wondering if it's a good idea and/or if there are any
> > recommendations
> > > > regarding selecting/tweaking network configuration (software+hardware)
> > > > for an index of this size.
> > 
> > Regards,
> > 
> > Slava Imeshev
> 
> 
> 


Mime
View raw message