lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Infrastructure for large Lucene index
Date Fri, 06 Oct 2006 23:09:57 GMT
>I am dealing with pretty challenging task, so I thought it would be
>a good idea to ask community before I re-invent any wheels of my own.
>I have a Lucene index that is going to grow to 100GB soon. This is
>index going to be read very aggresively (10s of millions  requests
>per day) with some occasional updates (10 batches per day).
>The idea is to split load between multiple server nodes running Lucene
>on *nix while accessing the same index that is shared across the network.
>I am wondering if it's a good idea and/or if there are any recommendations
>regarding selecting/tweaking network configuration (software+hardware)
>for an index of this size.

A few quick comments to this, including some of the subsequent thread 

1. Unless you have a lot of RAM, sufficient to effectively keep the 
entire index in memory, you're better off maximizing the number of 
spindles. Using one big file server is, IMHO, a Really Bad Idea for 
this type of application. You'll pay top dollar for something with 
the reliability and performance that you think you need, and then 
you'll still wind up being I/O bound.

I'd say the best configuration is a dual CPU/dual core box with 4 
fast drives and a boat-load of RAM - say 8GB for starters. You run 
four JVMs with four indexes on each box, where each index is on a 
separate drive. Assume the file system will do a reasonable job of 
caching data for you, so don't bother trying to use RAMDirectory or 

2. It's easy to get hung up on document frequency skews. As James and 
others have noted, in general things seem to work OK by just 
randomizing which document goes to what index - e.g. do it by hash of 
the document URL/name, and make sure that every new batch of 
documents (if you're doing incremental updates) gets spread this same 
way. As long as your hash function has nothing to do with searchable 
terms that you care about, you should be OK.

3. If you're worried about high availability, then one fairly simple 
approach is to have two parallel set of search clusters, with a load 
balancer in front. For each cluster, monitor both the front-end 
server (where the results get combined) and each of the back-end 
search servers - for example, something like Big Brother or Ganglia. 
Then if one of the search servers (or, god forbid, the front end 
server) goes down, you can automatically remove that cluster from the 
load balancer's active set.

-- Ken
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

View raw message