lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James" <ja...@ryley.com>
Subject RE: Infrastructure for large Lucene index
Date Fri, 06 Oct 2006 20:37:54 GMT
We have both.  Although we have multiple collections, our largest
collections are still way too big for one machine.  You have to come up with
a scheme to randomly split documents across multiple servers (randomly so
that word frequency issues hopefully don't mean one index is getting pounded
while others are inactive for certain searches).

Sincerely,
James

> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 4:33 PM
> To: general@lucene.apache.org
> Subject: RE: Infrastructure for large Lucene index
> 
> James,
> 
> --- James <james@ryley.com> wrote:
> > We currently do this across many machines for
> > http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across
> our
> > various collections, even larger than you need.  We use Remote
> > ParalellMultiSearcher, with some custom modifications (and we are in the
> > process of making more) to allow most robust handling of many processes
> at
> 
> I am not sure if ParalellMultiSearcher going to help here because we have
> a large uniform index, not a set of collections.
> 
> > once and integration of the responses from various sub-indexes.  This
> works
> > fine on commodity hardware, and you will be IO bound, so get multiple
> drives
> > in each machine.
> >
> > Out of curiosity, what project are you working on?  That's a lot of
> hits!
> >
> > Sincerely,
> > James Ryley, Ph.D.
> > www.FreePatentsOnline.com
> >
> >
> > > -----Original Message-----
> > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > Sent: Friday, October 06, 2006 2:28 PM
> > > To: general@lucene.apache.org
> > > Subject: Infrastructure for large Lucene index
> > >
> > >
> > > I am dealing with pretty challenging task, so I thought it would be
> > > a good idea to ask community before I re-invent any wheels of my own.
> > >
> > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > index going to be read very aggresively (10s of millions  requests
> > > per day) with some occasional updates (10 batches per day).
> > >
> > > The idea is to split load between multiple server nodes running Lucene
> > > on *nix while accessing the same index that is shared across the
> network.
> > >
> > > I am wondering if it's a good idea and/or if there are any
> recommendations
> > > regarding selecting/tweaking network configuration (software+hardware)
> > > for an index of this size.
> 
> Regards,
> 
> Slava Imeshev


Mime
View raw message