lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James" <ja...@ryley.com>
Subject RE: Kind of hardware config ?
Date Tue, 29 Aug 2006 12:43:49 GMT
Hi,

He said that he has 100GB of data -- the number of documents is somewhat
unimportant.  100GB of data is going to end up being 30-60GB of index,
depending on certain things like whether you want to store both a stemmed
and unstemmed index (we do, to give the user the option of how they want to
search).  No way you are going to get that much data into memory on a normal
server -- half the memory will be used by the OS, JVM, etc.  I just
specified 4GB to give plenty for the general machine processes to work with,
but assumed that you will almost never be referring to actual data or
indexes in RAM.

Sincerely,
James Ryley, Ph.D.

> -----Original Message-----
> From: Fredrik Andersson [mailto:fidde.andersson@gmail.com]
> Sent: Tuesday, August 29, 2006 4:29 AM
> To: general@lucene.apache.org
> Subject: Re: Kind of hardware config ?
> 
> Hey guys.
> 
> 4Gb of RAM for an index of 2 million documents should really not be a
> problem. You should consider separating the index from the actual content
> (
> i.e, only save the index data in your index, not the html), if you have
> the
> possibility to do that. I am not very comfortable with the very core
> functionality in Lucene, but even if you stored the raw data with the
> index
> data, only the index data should be held in memory and the raw data read
> from disk with, if there's room, some caching.
> 
> With the numbers you mention James, it sounds like both the raw data and
> index data is held in memory? If you have a good insight into the
> internals,
> feel free to correct me on this issue... i'm also involved in applications
> with very large indices, so this is very interesting.
> 
> Thanks,
> Fredrik
> 
> 
> On 8/28/06, James <james@ryley.com> wrote:
> >
> > OK, so you aren't going to get it into memory unless you spend a lot on
> > servers.  We haven't found memory (or disk access) to be a limiting
> factor
> > anyway -- CPU is the issue.  I'm not sure what you want to spend, but a
> > single server with SATA RAID, 4GB RAM and the latest AMD processor will
> > search your collection in ~10-20 seconds, depending on the complexity of
> > the
> > search.  If you need faster performance or the ability to support many
> > hits
> > at once, you are going to have to parallelize the configuration across
> > multiple servers using ParallelMultiSearcher.
> >
> > Keep in mind that Lucene isn't really set up to handle parallel
> searching
> > robustly.  There is a lot of code you are going to have to write for an
> > enterprise-ready solution (e.g., checking the status of a given server
> to
> > make sure it isn't down, redundantly storing indexes so that the search
> > still functions if one server is down, potentially handling laggards to
> > increase speed, etc.).
> >
> > We have done some of this, and have more to do -- it is a very non-
> trivial
> > task.
> >
> > Sincerely,
> > James Ryley, Ph.D.
> >
> > > -----Original Message-----
> > > From: caribou_surf [mailto:eric@mixad.com]
> > > Sent: Monday, August 28, 2006 10:42 AM
> > > To: general@lucene.apache.org
> > > Subject: RE: Kind of hardware config ?
> > >
> > >
> > > About 100 Giga
> > >
> > >
> > >
> > > James-10 wrote:
> > > >
> > > > What's the total document size?
> > > >
> > > > Sincerely,
> > > > James Ryley, Ph.D.
> > > >
> > > >> -----Original Message-----
> > > >> From: caribou_surf [mailto:eric@mixad.com]
> > > >> Sent: Monday, August 28, 2006 5:01 AM
> > > >> To: general@lucene.apache.org
> > > >> Subject: Kind of hardware config ?
> > > >>
> > > >>
> > > >> We want to index about 2 millions of html documents with Lucune.
> > > >> Have you an idea of the machine configuration the most adapted (bi
> > > proc,
> > > >> 2
> > > >> Go on memrory, raid disks...) ?
> > > >> --
> > > >> View this message in context: http://www.nabble.com/Kind-of-
> hardware-
> > > >> config---tf2176085.html#a6016661
> > > >> Sent from the Lucene - General forum at Nabble.com.
> > > >
> > > >
> > > >
> > >
> > > --
> > > View this message in context: http://www.nabble.com/Kind-of-hardware-
> > > config---tf2176085.html#a6021457
> > > Sent from the Lucene - General forum at Nabble.com.
> >
> >


Mime
View raw message