lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex McManus" <>
Subject Making a case for Lucene
Date Wed, 30 Jun 2004 12:25:28 GMT
we are at the initial design stages of a public-facing web-based search
application for a U.S. Federal Agency. We have proposed a clustered Lucene
architecture as the best technical solution, as we feel their current system
(based on Oracle) won't give the best performance, and introduces a lot of
unnecessary complexity and expense (as the system is read-only). We also
feel that the Lucene design will be very flexible, easier to maintain and
Government agencies are notoriously conservative when it comes to decisions
about technology, especially when open-source is involved. Perhaps
surprisingly, their response has been encouraging. However, they want
further re-assurance that other big-name organizations have successfully
used Lucene for large datasets.
First some background: we will be searching a number of repositories, the
largest of which includes about 600,000 documents, and might reach 10
million over the next 10 years. The documents are probably comparable to web
pages in terms of average size, and would be indexed under about 10
different fields. Our plan is to partition the indexes and distribute then
over a number of modern Intel/Linux servers.
>From what I pick up on the mailing lists, this seems well within the
capabilities of Lucene. I've looked at the Powered-by Lucene pages, but
there are two problems: (i) there are no details on the size of the datasets
being searched; (ii) I don't think our customer would recognize any of these
In Otis' OnJava article, he list "FedEx, Overture, Mayo Clinic, Hewlett
Packard, New Scientist magazine, Epiphany, and others using, or at least
evaluating, Lucene". This is more like it(!), but I want to be honest and
open with our customer, and the "or at least evaluating" comment is not
concrete enough, and there is no idea of scale.

The best example that I've been able to find is the Yahoo research lab - as
I understand it, this is a Nutch (i.e. Lucene) implementation that's
providing impressive performance over a 100 million document repository.

I would be very grateful if anyone could pass on some basic details of
successful large-scale Lucene projects, and even more so if they involve a
"big name" or government agency. If you are happy to pass this information
on, but would prefer to keep it off the public mailing list, then please
email me directly - I will respect confidentiality.

I think that this problem of re-assuring customers/managers is a common one,
so I would be happy to collate any responses to this as a new Wiki entry.
Hopefully one day (with their permission) we will be able to add our
customer to the Powered-by Lucene page too.
Thanks in advance,
Alex McManus (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message