lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <>
Subject Scaling out/up or a mix
Date Fri, 26 Jun 2009 22:00:40 GMT

I currently have an index which is 16GB per machine (8 machines = 128GB)
(data is stored externally, not in index) and is growing like crazy (we are
indexing blogs which is crazy by nature) and have only allocated 2GB per
machine to the Lucene app since we are running some other stuff there in

Each doc should be roughly the size of a blog post, no more than 20k.

We currently have about 90M documents and it is increasing rapidly so
getting into the G+ document range is not going to be too far away.

Now due to search performance I think I need to move these instances to
dedicated index/search machines (or index on some machines and search on
others). Anyway I would like to get some feedback about two things:

1. What is the most important hardware aspect when it comes to add document
to the index and optimize it.
1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
1.2 Is it RAM ?
1.3 Is is CPU ?

My guess would be disk-io, right, wrong ?

2. What is the most important hardware aspect when it comes to searching
documents in my setup ? (result-set is limited to return only the top 10
matches with page handling)
2.1 Is it disk read throughput ? (sequential or random-io ?)
2.2 Is it RAM ?
2.3 Is is CPU ?

I have no clue since the data might not fit into memory. What is then the
most important factor ? read-performance while scanning the index ? CPU
while comparing fields and collecting results ?

What I'm trying to find out is what I can do to get most bang for the buck
with a limited (aren't we all limited?) budget.



Marcus Herou CTO and co-founder Tailsweep AB

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message