Return-Path:
+ The purpose of these user-submitted performance figures is to
+ give current and potential users of Lucene a sense
+ of how well Lucene scales. If the requirements for an upcoming
+ project is similar to an existing benchmark, you
+ will also have something to work with when designing the system
+ architecture for the application.
+
+ If you've conducted performance tests with Lucene, we'd
+ appreciate if you can submit these figures for display
+ on this page. Post these figures to the lucene-user mailing list
+ using this
+ template.
+
+
+ Hardware Environment
+
+
+ Software environment
+
+ Lucene indexing variables
+
+ Figures
+
+ Notes
+
+ These benchmarks have been kindly submitted by Lucene users for + reference purposes. +
+We make NO guarantees regarding their accuracy or + validity. +
+We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +
+ +
+ Hardware Environment
+
+ Software environment
+
+ Lucene indexing variables
+
+ Figures
+
+ Notes
+
+ A windows client ran a random document generator which
+ created
+ documents based on some arrays of values and an excerpt
+ (approx 1kb)
+ from a text file of the bible (King James version).
+ These were submitted via a socket connection (open throughout
+ indexing process).
+ The index writer was not closed between index calls.
+ This created a 400Mb index in 23 files (after
+ optimization).
+
+ Query details:
+
+ Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +
++ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +
++ This query counted 34000 documents and I limited the returned + documents + to 5. +
++ This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +
++ Threads|Avg Time per query (ms) + 1 1009ms + 2 2043ms + 3 3087ms + 4 4045ms + .. . + .. . + 10 10091ms ++
+ I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +
+Other query optimizations made little difference.
+ + ++ Hamish can be contacted at hamish at catalyst.net.nz. +
+
+ Hardware Environment
+
+ Software environment
+
+ Lucene indexing variables
+
+ Figures
+
+ Notes
+
+ We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +
+ ++ Justin can be contacted at tvxh-lw4x at spamex.com. +
++ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +
+
+ Hardware Environment
+
+ Software environment
+
+ Lucene indexing variables
+
+ Figures
+
+ Notes
+
+ The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc.) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +
+ ++ Daniel can be contacted at Armbrust.Daniel at mayo.edu. +
++ I'm doing a technical evaluation of search engines + for Ariba, an enterprise application software company. + I compared Lucene to a commercial C language based + search engine which I'll refer to as vendor A. + Overall Lucene's performance was similar to vendor A + and met our application's requirements. I've + summarized our results below. +
+
+ Search scalability:
+ We ran a set of 16 queries in a single thread for 20
+ iterations. We report below the times for the last 15
+ iterations (ie after the system was warmed up). The
+ 4 sets of results below are for indexes with between
+ 50,000 documents to 600,000 documents. Although the
+ times for Lucene grew faster with document count than
+ vendor A they were comparable.
+
+50K documents +Lucene 5.2 seconds +A 7.2 +200K +Lucene 15.3 +A 15.2 +400K +Lucene 28.2 +A 25.5 +600K +Lucene 41 +A 33 ++
+ Individual Query times:
+ Total query times are very similar between the 2
+ systems but there were larger differences when you
+ looked at individual queries.
+
+ For simple queries with small result sets Vendor A was + consistently faster than Lucene. For example a + single query might take vendor A 32 thousands of a + second and Lucene 64 thousands of a second. Both + times are however well within acceptable response + times for our application. +
++ For simple queries with large result sets Vendor A was + consistently slower than Lucene. For example a + single query might take vendor A 300 thousands of a + second and Lucene 200 thousands of a second. + For more complex queries of the form (term1 or term2 + or term3) AND (term4 or term5 or term6) AND (term7 or + term8) the results were more divergent. For + queries with small result sets Vendor A generally had + very short response times and sometimes Lucene had + significantly larger response times. For example + Vendor A might take 16 thousands of a second and + Lucene might take 156. I do not consider it to be + the case that Lucene's response time grew unexpectedly + but rather that Vendor A appeared to be taking + advantage of an optimization which Lucene didn't have. + (I believe there's been discussions on the dev + mailing list on complex queries of this sort.) +
+
+ Index Size:
+ For our test data the size of both indexes grew
+ linearly with the number of documents. Note that
+ these sizes are compact sizes, not maximum size during
+ index loading. The numbers below are from running du
+ -k in the directory containing the index data. The
+ larger number's below for Vendor A may be because it
+ supports additional functionality not available in
+ Lucene. I think it's the constant rate of growth
+ rather than the absolute amount which is more
+ important.
+
+50K documents +Lucene 45516 K +A 63921 +200K +Lucene 171565 +A 228370 +400K +Lucene 345717 +A 457843 +600K +Lucene 511338 +A 684913 ++
+ Indexing Times:
+ These times are for reading the documents from our
+ database, processing them, inserting them into the
+ document search product and index compacting. Our
+ data has a large number of fields/attributes. For
+ this test I restricted Lucene to 24 attributes to
+ reduce the number of files created. Doing this I was
+ able to specify a merge width for Lucene of 60. I
+ found in general that Lucene indexing performance to
+ be very sensitive to changes in the merge width.
+ Note also that our application does a full compaction
+ after inserting every 20,000 documents. These times
+ are just within our acceptable limits but we are
+ interested in alternatives to increase Lucene's
+ performance in this area.
+
+
+600K documents +Lucene 81 minutes +A 34 minutes ++ +
+ (I don't have accurate results for all sizes on this + measure but believe that the indexing time for both + solutions grew essentially linearly with size. The + time to compact the index generally grew with index + size but it's a small percent of overall time at these + sizes.) +
+
+ Hardware Environment
+
+ Software environment
+
+ Lucene indexing variables
+
+ Figures
+
+ Notes
+
+
This page lists external Lucene resources. If you have + written something that should be included, please post all + relevant information to one of the mailing lists. Nothing + listed here is directly supported by the Lucene + developers, so if you encounter any problems with any of + this software, please use the author's contact information + to get help.
+If you are looking for information on contributing patches or other improvements to Lucene, see + How To Contribute on the Lucene Wiki.
++ Software that works with Lucene indices. +
++ URL + | ++ + http://www.getopt.org/luke/ + + | +
---|---|
+ author + | ++ Andrzej Bialecki + | +
+ URL + | ++ + http://limo.sf.net/ + + | +
---|---|
+ author + | ++ Julien Nioche + | +
+ Lucene requires information you want to index to be + converted into a Document class. Here are + contributions for various solutions that convert different + content types to Lucene's Document classes. +
++ URL + | ++ + http://marc.theaimsgroup.com/?l=lucene-dev&m=100723333506246&w=2 + + | +
---|---|
+ author + | ++ Philip Ogren - ogren@mayo.edu + | +
+ URL + | ++ + http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00346.html + + | +
---|---|
+ author + | ++ Peter Carlson - carlson@bookandhammer.com + | +
+ URL + | ++ + http://www.pdfbox.org/ + + | +
---|---|
+ author + | ++ Ben Litchfield - ben@csh.rit.edu + | +
+ URL + | ++ + http://www.foolabs.com/xpdf + + | +
---|---|
+ author + | ++ N/A + | +
+ URL + | ++ + http://snowtide.com + + | +
---|---|
+ author + | ++ N/A + | +
+ URL + | ++ + http://www.etymon.com/ + + | +
---|---|
+ author + | ++ N/A + | +
+
++ URL + | ++ + http://savannah.nongnu.org/projects/aramorph + + | +
---|---|
+ author + | ++ Pierrick Brihaye + | +
+ URL + | ++ + http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html + + | +
---|---|
+ author + | ++ tangentum technologies + | +
+
++ URL + | ++ + http://ejindex.sourceforge.net/ + + | +
---|---|
+ author + | ++ Andy Scholz + | +
+ URL + | ++ + https://javacc.dev.java.net/ + + | +
---|---|
+ author + | ++ Sun Microsystems (java.net) + | +
+This document is intended as a "getting started" guide to using and running the Lucene demos. +It walks you through some basic installation and configuration. +
++The Lucene command-line demo code consists of two applications that demonstrate various +functionalities of Lucene and how one should go about adding Lucene to their applications. +
+
+First, you should download the
+latest Lucene distribution and then extract it to a working directory. Alternatively, you can check out the sources from
+Subversion, and then run ant war-demo
to generate the JARs and WARs.
+
+You should see the Lucene JAR file in the directory you created when you extracted the archive. It
+should be named something like lucene-core-{version}.jar
. You should also see a file
+called lucene-demos-{version}.jar
. If you checked out the sources from Subversion then
+the JARs are located under the build
subdirectory (after running ant
+successfully). Put both of these files in your Java CLASSPATH.
+
+Once you've gotten this far you're probably itching to go. Let's build an index! Assuming +you've set your CLASSPATH correctly, just type: + +
+ java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src ++ +This will produce a subdirectory called
index
which will contain an index of all of the
+Lucene source code.
+
++To search the index type: + +
+ java org.apache.lucene.demo.SearchFiles ++ +You'll be prompted for a query. Type in a swear word and press the enter key. You'll see that the +Lucene developers are very well mannered and get no results. Now try entering the word "vector". +That should return a whole bunch of documents. The results will page at every tenth result and ask +you whether you want more results. + +
+read on>>> +
++In this section we walk through the sources behind the command-line Lucene demo: where to find them, +their parts and their function. This section is intended for Java developers wishing to understand +how to use Lucene in their applications. +
+
+Relative to the directory created when you extracted Lucene or retrieved it from Subversion, you
+should see a directory called src
which in turn contains a directory called
+demo
. This is the root for all of the Lucene demos. Under this directory is
+org/apache/lucene/demo
. This is where all the Java sources for the demos live.
+
+Within this directory you should see the IndexFiles.java
class we executed earlier.
+Bring it up in vi
or your editor of choice and let's take a look at it.
+
+As we discussed in the previous walk-through, the IndexFiles
class creates a Lucene
+Index. Let's take a look at how it does this.
+
+The first substantial thing the main
function does is instantiate IndexWriter
. It passes the string
+"index
" and a new instance of a class called StandardAnalyzer
.
+The "index
" string is the name of the filesystem directory where all index information
+should be stored. Because we're not passing a full path, this will be created as a subdirectory of
+the current working directory (if it does not already exist). On some platforms, it may be created
+in other directories (such as the user's home directory).
+
+The IndexWriter
is the main
+class responsible for creating indices. To use it you must instantiate it with a path that it can
+write the index into. If this path does not exist it will first create it. Otherwise it will
+refresh the index at that path. You can also create an index using one of the subclasses of Directory
. In any case, you must also pass an
+instance of org.apache.lucene.analysis.Analyzer
.
+
+The particular Analyzer
we
+are using, StandardAnalyzer
, is
+little more than a standard Java Tokenizer, converting all strings to lowercase and filtering out
+useless words and characters from the index. By useless words and characters I mean common language
+words such as articles (a, an, the, etc.) and other strings that would be useless for searching
+(e.g. 's) . It should be noted that there are different rules for every language, and you
+should use the proper analyzer for each. Lucene currently provides Analyzers for a number of
+different languages (see the *Analyzer.java
sources under contrib/analyzers/src/java/org/apache/lucene/analysis).
+
+Looking further down in the file, you should see the indexDocs()
code. This recursive
+function simply crawls the directories and uses FileDocument
to create Document
objects. The Document
is simply a data object to
+represent the content in the file as well as its creation time and location. These instances are
+added to the indexWriter
. Take a look inside FileDocument
. It's not particularly
+complicated. It just adds fields to the Document
.
+
+As you can see there isn't much to creating an index. The devil is in the details. You may also
+wish to examine the other samples in this directory, particularly the IndexHTML
class. It is a bit more
+complex but builds upon this example.
+
+The SearchFiles
class is
+quite simple. It primarily collaborates with an IndexSearcher
, StandardAnalyzer
+(which is used in the IndexFiles
class as well) and a
+QueryParser
. The
+query parser is constructed with an analyzer used to interpret your query text in the same way the
+documents are interpreted: finding the end of words and removing useless words like 'a', 'an' and
+'the'. The Query
object contains
+the results from the QueryParser
which is passed to
+the searcher. Note that it's also possible to programmatically construct a rich Query
object without using the query
+parser. The query parser just enables decoding the Lucene query
+syntax into the corresponding Query
object. The searcher results are
+returned in a collection of Documents called Hits
which is then iterated through and
+displayed to the user.
+
+read on>>> +
+ ++This document is intended as a "getting started" guide to installing and running the Lucene +web application demo. This guide assumes that you have read the information in the previous two +examples. We'll use Tomcat as our reference web container. These demos should work with nearly any +container, but you may have to adapt them appropriately. +
++The Lucene Web Application demo is a template web application intended for deployment on Tomcat or a +similar web container. It's NOT designed as a "best practices" implementation by ANY means. It's +more of a "hello world" type Lucene Web App. The purpose of this application is to demonstrate +Lucene. With that being said, it should be relatively simple to create a small searchable website +in Tomcat or a similar application server. +
+Once you've gotten this far you're probably itching to go. Let's start by creating the index +you'll need for the web examples. Since you've already set your CLASSPATH in the previous examples, +all you need to do is type: + +
+ java org.apache.lucene.demo.IndexHTML -create -index {index-dir} .. ++ +You'll need to do this from a (any) subdirectory of your
{tomcat}/webapps
directory
+(make sure you didn't leave off the ..
or you'll get a null pointer exception).
+{index-dir}
should be a directory that Tomcat has permission to read and write, but is
+outside of a web accessible context. By default the webapp is configured to look in
+/opt/lucene/index
for this index.
+
+Located in your distribution directory you should see a war file called
+luceneweb.war
. If you're working with a Subversion checkout, this will be under the
+build
subdirectory. Copy this to your {tomcat-home}/webapps
directory.
+You may need to restart Tomcat.
From your Tomcat directory look in the webapps/luceneweb
subdirectory. If it's not
+present, try browsing to http://localhost:8080/luceneweb
(which causes Tomcat to deploy
+the webapp), then look again. Edit a file called configuration.jsp
. Ensure that the
+indexLocation
is equal to the location you used for your index. You may also customize
+the appTitle
and appFooter
strings as you see fit. Once you have finished
+altering the configuration you may need to restart Tomcat. You may also wish to update the war file
+by typing jar -uf luceneweb.war configuration.jsp
from the luceneweb
+subdirectory. (The -u option is not available in all versions of jar. In this case recreate the
+war file).
+
Now you're ready to roll. In your browser set the url to
+http://localhost:8080/luceneweb
enter test
and the number of items per
+page and press search.
You should now be looking either at a number of results (provided you didn't erase the Tomcat
+examples) or nothing. If you get an error regarding opening the index, then you probably set the
+path in configuration.jsp
incorrectly or Tomcat doesn't have permissions to the index
+(or you skipped the step of creating it). Try other search terms. Depending on the number of items
+per page you set and results returned, there may be a link at the bottom that says More
+Results>>; clicking it takes you to subsequent pages.
+If you want to know more about how this web app works or how to customize it then read on>>>. +
+