lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kovnatsky, Eugene" <>
Subject RE: Lucene Software/Hardware Setup Question
Date Tue, 26 Oct 2010 21:17:37 GMT
Thanks Toke. Very descriptive. A few more questions about your SSD
 - what is its current size 
 - do you project any growth in your index size
 - if yes then how do you plan to correlate that with your hardware

Thanks again


-----Original Message-----
From: Toke Eskildsen [] 
Sent: Tuesday, October 26, 2010 2:26 AM
Subject: Re: Lucene Software/Hardware Setup Question

On Tue, 2010-10-26 at 02:16 +0200, Kovnatsky, Eugene wrote:
> I am trying to get some information on what enterprise hardware folks 
> use out there. We are using Lucene extensively. Our total catalogs 
> size is roughly 50GB between roughly 8 various catalogs, 2 of which 
> take up 60-70% of this size.

That sounds a lot like our setup at the State and University Library,
Denmark. We have about 9M records with an index size of 59GB, with 4,5M
OAI-PMH harvested records and 2,5M bibliographic records from our
Aleph-system. The rest of the records are divided among 16 different

> So my question is - if any of you guys have similar catalog sizes then

> what kind of software/hardware do you have running, i.e. what app 
> servers, how many, what hardware are these app servers running on?

We use a home brewed setup called Summa (open source) to handle the
workflow and the searching. It uses plain Lucene with a few custom
analyzers and some sorting, faceting, suggest and DidYouMean code.
One index holds all the material. Currently the index is updated on one
server and synced to two search-machines, but we're in the middle of
moving the index updating to the servers to get faster updates.

The hardware is 2 mirrored servers for fail-safe. They are running some
Linux variant and have 2.5GHz quad-core Xeons CPU's with 6MB of level 2
cache and 16GB of RAM. We are not using virtualization for this. The
machines uses traditional harddisks for data storage and fairly old
enterprise-class SSD's for the index. To be honest, they are currently
overkill - without faceting the throughput is 50-100 searches/second,
including the overhead of using web-service calls. Faceting slows this
somewhat, but as our traffic is something like 5-10 searches/second at
prime time (guesstimating a lot here, as it is has been a year or two
since I looked at the statistics), most of the time is spend on idle.

Before that we used dual-core Xeons, again with 16GB of RAM and SSD's.
They also performed just fine with our workload and were only replaced
due to a general reorganization of the servers. Before that, we used
used some older 3.1GHz single-core Xeon machines with only 1MB of level
2 cache, 32GB of slow RAM and traditional harddisks. My old 1.8GHz
single-core laptop were about as fast for indexing & searching and they
stand testament that a lot of RAM and GHz does not help much when the
memory system is lacking.

We did a lot of testing some time ago and found that out searches were
mostly CPU-bound when using SSDs. We've talked with our hardware guys
about building new servers in anticipation of more data and the current
vision is relatively modest machines with quad-core i7, 16GB of RAM and
consumer-grade SSDs (Intel or SandForce). As we have mirrored servers
and since no one dies if they can't find a book at our library, using
enterprise-SSDs is just a waste of money.

Toke Eskildsen

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message