Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 17142 invoked from network); 27 Nov 2006 00:01:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Nov 2006 00:01:32 -0000 Received: (qmail 14832 invoked by uid 500); 27 Nov 2006 00:01:42 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 14753 invoked by uid 500); 27 Nov 2006 00:01:41 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 14742 invoked by uid 99); 27 Nov 2006 00:01:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:41 -0800 X-ASF-Spam-Status: No, hits=-9.4 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO eris.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:27 -0800 Received: by eris.apache.org (Postfix, from userid 65534) id 51A3D1A9846; Sun, 26 Nov 2006 16:00:50 -0800 (PST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r479465 [1/4] - in /lucene/java/trunk: docs/ docs/images/ docs/lucene-sandbox/ docs/styles/ src/site/ src/site/src/ src/site/src/documentation/ src/site/src/documentation/classes/ src/site/src/documentation/conf/ src/site/src/documentation/... Date: Mon, 27 Nov 2006 00:00:49 -0000 To: java-commits@lucene.apache.org From: gsingers@apache.org X-Mailer: svnmailer-1.1.0 Message-Id: <20061127000050.51A3D1A9846@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: gsingers Date: Sun Nov 26 16:00:46 2006 New Revision: 479465 URL: http://svn.apache.org/viewvc?view=rev&rev=479465 Log: Updated the website to new Forrest based site, see Issue 707, part one of commits Added: lucene/java/trunk/src/site/ (with props) lucene/java/trunk/src/site/forrest.properties (with props) lucene/java/trunk/src/site/src/ lucene/java/trunk/src/site/src/documentation/ lucene/java/trunk/src/site/src/documentation/classes/ lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties (with props) lucene/java/trunk/src/site/src/documentation/conf/ lucene/java/trunk/src/site/src/documentation/conf/cli.xconf lucene/java/trunk/src/site/src/documentation/content/ lucene/java/trunk/src/site/src/documentation/content/.htaccess lucene/java/trunk/src/site/src/documentation/content/xdocs/ lucene/java/trunk/src/site/src/documentation/content/xdocs/benchmarks.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/contributions.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/demo.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/demo2.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/demo3.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/images/ lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/index.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/lucene-sandbox/ lucene/java/trunk/src/site/src/documentation/content/xdocs/lucene-sandbox/index.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/mailinglists.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/queryparsersyntax.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/releases.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/resources.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/scoring.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/site.xml (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/systemproperties.xml lucene/java/trunk/src/site/src/documentation/content/xdocs/tabs.xml (with props) lucene/java/trunk/src/site/src/documentation/content/xdocs/whoweare.xml lucene/java/trunk/src/site/src/documentation/sitemap.xmap (with props) lucene/java/trunk/src/site/src/documentation/skinconf.xml (with props) Removed: lucene/java/trunk/docs/benchmarks.html lucene/java/trunk/docs/benchmarktemplate.xml lucene/java/trunk/docs/contributions.html lucene/java/trunk/docs/demo.html lucene/java/trunk/docs/demo2.html lucene/java/trunk/docs/demo3.html lucene/java/trunk/docs/demo4.html lucene/java/trunk/docs/features.html lucene/java/trunk/docs/fileformats.html lucene/java/trunk/docs/gettingstarted.html lucene/java/trunk/docs/images/ lucene/java/trunk/docs/index.html lucene/java/trunk/docs/lucene-sandbox/ lucene/java/trunk/docs/mailinglists.html lucene/java/trunk/docs/queryparsersyntax.html lucene/java/trunk/docs/resources.html lucene/java/trunk/docs/scoring.html lucene/java/trunk/docs/styles/ lucene/java/trunk/docs/systemproperties.html lucene/java/trunk/docs/whoweare.html lucene/java/trunk/xdocs/ Propchange: lucene/java/trunk/src/site/ ------------------------------------------------------------------------------ --- svn:ignore (added) +++ svn:ignore Sun Nov 26 16:00:46 2006 @@ -0,0 +1 @@ +build Added: lucene/java/trunk/src/site/forrest.properties URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/forrest.properties?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/forrest.properties (added) +++ lucene/java/trunk/src/site/forrest.properties Sun Nov 26 16:00:46 2006 @@ -0,0 +1,130 @@ +# Copyright 2002-2005 The Apache Software Foundation or its licensors, +# as applicable. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +############## +# Properties used by forrest.build.xml for building the website +# These are the defaults, un-comment them only if you need to change them. +############## + +# Prints out a summary of Forrest settings for this project +#forrest.echo=true + +# Project name (used to name .war file) +#project.name=my-project + +# Specifies name of Forrest skin to use +# See list at http://forrest.apache.org/docs/skins.html +#project.skin=pelt + +# Descriptors for plugins and skins +# comma separated list, file:// is supported +#forrest.skins.descriptors=http://forrest.apache.org/skins/skins.xml,file:///c:/myskins/skins.xml +#forrest.plugins.descriptors=http://forrest.apache.org/plugins/plugins.xml,http://forrest.apache.org/plugins/whiteboard-plugins.xml + +############## +# behavioural properties +#project.menu-scheme=tab_attributes +#project.menu-scheme=directories + +############## +# layout properties + +# Properties that can be set to override the default locations +# +# Parent properties must be set. This usually means uncommenting +# project.content-dir if any other property using it is uncommented + +#project.status=status.xml +#project.content-dir=src/documentation +#project.raw-content-dir=${project.content-dir}/content +#project.conf-dir=${project.content-dir}/conf +#project.sitemap-dir=${project.content-dir} +#project.xdocs-dir=${project.content-dir}/content/xdocs +#project.resources-dir=${project.content-dir}/resources +#project.stylesheets-dir=${project.resources-dir}/stylesheets +#project.images-dir=${project.resources-dir}/images +#project.schema-dir=${project.resources-dir}/schema +#project.skins-dir=${project.content-dir}/skins +#project.skinconf=${project.content-dir}/skinconf.xml +#project.lib-dir=${project.content-dir}/lib +#project.classes-dir=${project.content-dir}/classes +#project.translations-dir=${project.content-dir}/translations +project.configfile=${project.home}/src/documentation/conf/cli.xconf + +############## +# validation properties + +# This set of properties determine if validation is performed +# Values are inherited unless overridden. +# e.g. if forrest.validate=false then all others are false unless set to true. +#forrest.validate=true +#forrest.validate.xdocs=${forrest.validate} +#forrest.validate.skinconf=${forrest.validate} +#forrest.validate.sitemap=${forrest.validate} +#forrest.validate.stylesheets=${forrest.validate} +#forrest.validate.skins=${forrest.validate} +#forrest.validate.skins.stylesheets=${forrest.validate.skins} + +# *.failonerror=(true|false) - stop when an XML file is invalid +#forrest.validate.failonerror=true + +# *.excludes=(pattern) - comma-separated list of path patterns to not validate +# e.g. +#forrest.validate.xdocs.excludes=samples/subdir/**, samples/faq.xml +#forrest.validate.xdocs.excludes= + + +############## +# General Forrest properties + +# The URL to start crawling from +#project.start-uri=linkmap.html + +# Set logging level for messages printed to the console +# (DEBUG, INFO, WARN, ERROR, FATAL_ERROR) +#project.debuglevel=ERROR + +# Max memory to allocate to Java +#forrest.maxmemory=64m + +# Any other arguments to pass to the JVM. For example, to run on an X-less +# server, set to -Djava.awt.headless=true +#forrest.jvmargs= + +# The bugtracking URL - the issue number will be appended +#project.bugtracking-url=http://issues.apache.org/bugzilla/show_bug.cgi?id= +#project.bugtracking-url=http://issues.apache.org/jira/browse/ + +# The issues list as rss +#project.issues-rss-url= + +#I18n Property. Based on the locale request for the browser. +#If you want to use it for static site then modify the JVM system.language +# and run once per language +#project.i18n=true + +# The names of plugins that are required to build the project +# comma separated list (no spaces) +# You can request a specific version by appending "-VERSION" to the end of +# the plugin name. If you exclude a version number the latest released version +# will be used, however, be aware that this may be a development version. In +# a production environment it is recomended that you specify a known working +# version. +# Run "forrest available-plugins" for a list of plug-ins currently available +project.required.plugins=org.apache.forrest.plugin.output.pdf + +# Proxy configuration +# proxy.host= +# proxy.port= Propchange: lucene/java/trunk/src/site/forrest.properties ------------------------------------------------------------------------------ svn:executable = * Added: lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties (added) +++ lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties Sun Nov 26 16:00:46 2006 @@ -0,0 +1,57 @@ +# Copyright 2002-2005 The Apache Software Foundation or its licensors, +# as applicable. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#======================================================================= +# CatalogManager.properties for Catalog Entity Resolver. +# +# This is the default properties file for your project. +# This facilitates local configuration of application-specific catalogs. +# If you have defined any local catalogs, then they will be loaded +# before Forrest's core catalogs. +# +# See the Apache Forrest documentation: +# http://forrest.apache.org/docs/your-project.html +# http://forrest.apache.org/docs/validation.html + +# verbosity: +# The level of messages for status/debug (messages go to standard output). +# The setting here is for your own local catalogs. +# The verbosity of Forrest's core catalogs is controlled via +# main/webapp/WEB-INF/cocoon.xconf +# +# The following messages are provided ... +# 0 = none +# 1 = ? (... not sure yet) +# 2 = 1+, Loading catalog, Resolved public, Resolved system +# 3 = 2+, Catalog does not exist, resolvePublic, resolveSystem +# 10 = 3+, List all catalog entries when loading a catalog +# (Cocoon also logs the "Resolved public" messages.) +verbosity=1 + +# catalogs ... list of additional catalogs to load +# (Note that Apache Forrest will automatically load its own default catalog +# from main/webapp/resources/schema/catalog.xcat) +# Use either full pathnames or relative pathnames. +# pathname separator is always semi-colon (;) regardless of operating system +# directory separator is always slash (/) regardless of operating system +catalogs=../resources/schema/catalog.xcat + +# relative-catalogs +# If false, relative catalog URIs are made absolute with respect to the +# base URI of the CatalogManager.properties file. This setting only +# applies to catalog URIs obtained from the catalogs property in the +# CatalogManager.properties file +# Example: relative-catalogs=[yes|no] +relative-catalogs=no Propchange: lucene/java/trunk/src/site/src/documentation/classes/CatalogManager.properties ------------------------------------------------------------------------------ svn:executable = * Added: lucene/java/trunk/src/site/src/documentation/conf/cli.xconf URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/conf/cli.xconf?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/conf/cli.xconf (added) +++ lucene/java/trunk/src/site/src/documentation/conf/cli.xconf Sun Nov 26 16:00:46 2006 @@ -0,0 +1,321 @@ + + + + + + + + . + WEB-INF/cocoon.xconf + ../tmp/cocoon-work + ../site + + + + + + + + + + + + + + + index.html + + + + + + + */* + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Added: lucene/java/trunk/src/site/src/documentation/content/.htaccess URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/.htaccess?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/.htaccess (added) +++ lucene/java/trunk/src/site/src/documentation/content/.htaccess Sun Nov 26 16:00:46 2006 @@ -0,0 +1,3 @@ +#Forrest generates UTF-8 by default, but these httpd servers are +#ignoring the meta http-equiv charset tags +AddDefaultCharset off Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/benchmarks.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/benchmarks.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/benchmarks.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/benchmarks.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,525 @@ + + +
+ Apache Lucene - Resources - Performance Benchmarks +
+ + Kelvin Tan + + + + +
Performance Benchmarks +

+ The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. +

+

+ If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + template. +

+
+ +
Benchmark Variables +

+

    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: Self-explanatory + (yes/no)
  • +
  • CPU: Self-explanatory (Type, Speed and Quantity)
  • +
  • RAM: Self-explanatory
  • +
  • Drive configuration: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)
  • +

    +

    + Software environment
    +

  • Lucene Version: Self-explanatory
  • +
  • Java Version: Version of Java SDK/JRE that is run +
  • +
  • Java VM: Server/client VM, Sun VM/JRockIt
  • +
  • OS Version: Self-explanatory
  • +
  • Location of index: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: Number of documents being + indexed
  • +
  • Total filesize of source documents: + Self-explanatory
  • +
  • Average filesize of source documents: + Self-explanatory
  • +
  • Source documents storage location: Where are the + documents being indexed located? + Filesystem, DB, http, etc.
  • +
  • File type of source documents: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.
  • +
  • Parser(s) used, if any: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.
  • +
  • Analyzer(s) used: Type of Lucene analyzer used
  • +
  • Number of fields per document: Number of Fields each + Document contains
  • +
  • Type of fields: Type of each field
  • +
  • Index persistence: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc.
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 indexing + runs): Time taken to index all files
  • +
  • Time taken / 1000 docs indexed: Time taken to index + 1000 files
  • +
  • Memory consumption: Self-explanatory
  • +
  • Query speed: average time a query takes, type + of queries (e.g. simple one-term query, phrase query), + not measuring any overhead outside Lucene
  • +

    +

    + Notes
    +

  • Notes: Any comments which don't belong in the above, + special tuning/strategies, etc.
  • +

    +
+

+
+ +
User-submitted Benchmarks +

+ These benchmarks have been kindly submitted by Lucene users for + reference purposes. +

+

We make NO guarantees regarding their accuracy or + validity. +

+

We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +

+ +
Hamish Carpenter's benchmarks +
    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: yes
  • +
  • CPU: Intel x86 P4 1.5Ghz
  • +
  • RAM: 512 DDR
  • +
  • Drive configuration: IDE 7200rpm Raid-1
  • +

    +

    + Software environment
    +

  • Lucene Version: 1.3
  • +
  • Java Version: 1.3.1 IBM JITC Enabled
  • +
  • Java VM:
  • +
  • OS Version: Debian Linux 2.4.18-686
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: Random generator. Set + to make 1M documents + in 2x500,000 batches.
  • +
  • Total filesize of source documents: > 1GB if + stored
  • +
  • Average filesize of source documents: 1KB
  • +
  • Source documents storage location: Filesystem
  • +
  • File type of source documents: Generated
  • +
  • Parser(s) used, if any:
  • +
  • Analyzer(s) used: Default
  • +
  • Number of fields per document: 11
  • +
  • Type of fields: 1 date, 1 id, 9 text
  • +
  • Index persistence: FSDirectory
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs):
  • +
  • Time taken / 1000 docs indexed: 49 seconds
  • +
  • Memory consumption:
  • +

    +

    + Notes
    +

    + A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).
    + These were submitted via a socket connection (open throughout + indexing process).
    + The index writer was not closed between index calls.
    + This created a 400Mb index in 23 files (after + optimization).
    +

    +

    + Query details:
    +

    +

    + Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +

    +

    + Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +

    +

    + This query counted 34000 documents and I limited the returned + documents + to 5. +

    +

    + This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +

    +
    +                                Threads|Avg Time per query (ms)
    +                                1       1009ms
    +                                2       2043ms
    +                                3       3087ms
    +                                4       4045ms
    +                                ..        .
    +                                ..        .
    +                                10      10091ms
    +                            
    +

    + I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +

    +

    Other query optimizations made little difference.

    +

    +
+

+ Hamish can be contacted at hamish at catalyst.net.nz. +

+
+ +
Justin Greene's benchmarks +
    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: No, but nominal + usage at time of indexing.
  • +
  • CPU: Compaq Proliant 1850R/600 2 X pIII 600
  • +
  • RAM: 1GB, 256MB allocated to JVM.
  • +
  • Drive configuration: RAID 5 on Fibre Channel + Array
  • +

    +

    + Software environment
    +

  • Java Version: 1.3.1_06
  • +
  • Java VM:
  • +
  • OS Version: Winnt 4/Sp6
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: about 60K
  • +
  • Total filesize of source documents: 6.5GB
  • +
  • Average filesize of source documents: 100K + (6.5GB/60K documents)
  • +
  • Source documents storage location: filesystem on + NTFS
  • +
  • File type of source documents:
  • +
  • Parser(s) used, if any: Currently the only parser + used is the Quiotix html + parser.
  • +
  • Analyzer(s) used: SimpleAnalyzer
  • +
  • Number of fields per document: 8
  • +
  • Type of fields: All strings, and all are stored + and indexed.
  • +
  • Index persistence: FSDirectory
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.
  • +
  • Time taken / 1000 docs indexed:
  • +
  • Memory consumption: JVM is given 256MB and uses it + all.
  • +

    +

    + Notes
    +

    + We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +

    +

    +
+

+ Justin can be contacted at tvxh-lw4x at spamex.com. +

+
+ + +
Daniel Armbrust's benchmarks +

+ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +

+
    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
  • +
  • CPU: Sun Ultra 80 4 x 64 bit processors
  • +
  • RAM: 4 GB Memory
  • +
  • Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
  • +

    +

    + Software environment
    +

  • Lucene Version: 1.2
  • +
  • Java Version: 1.3.1
  • +
  • Java VM:
  • +
  • OS Version: Sun 5.8 (64 bit)
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: 13,820,517
  • +
  • Total filesize of source documents: 87.3 GB
  • +
  • Average filesize of source documents: 6.3 KB
  • +
  • Source documents storage location: Filesystem
  • +
  • File type of source documents: XML
  • +
  • Parser(s) used, if any:
  • +
  • Analyzer(s) used: A home grown analyzer that simply removes stopwords.
  • +
  • Number of fields per document: 1 - 31
  • +
  • Type of fields: All text, though 2 of them are dates (20001205) that we filter on
  • +
  • Index persistence: FSDirectory
  • +
  • Index size: 12.5 GB
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
  • +
  • Time taken / 1000 docs indexed: 340 Seconds
  • +
  • Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer
  • +

    +

    + Notes
    +

    + The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc.) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +

    +

    +
+

+ Daniel can be contacted at Armbrust.Daniel at mayo.edu. +

+
+
Geoffrey Peddle's benchmarks +

+ I'm doing a technical evaluation of search engines + for Ariba, an enterprise application software company. + I compared Lucene to a commercial C language based + search engine which I'll refer to as vendor A. + Overall Lucene's performance was similar to vendor A + and met our application's requirements. I've + summarized our results below. +

+

+ Search scalability:
+ We ran a set of 16 queries in a single thread for 20 + iterations. We report below the times for the last 15 + iterations (ie after the system was warmed up). The + 4 sets of results below are for indexes with between + 50,000 documents to 600,000 documents. Although the + times for Lucene grew faster with document count than + vendor A they were comparable. +

+
+50K  documents
+Lucene   5.2   seconds
+A        7.2
+200K
+Lucene   15.3
+A        15.2
+400K
+Lucene    28.2
+A         25.5
+600K
+Lucene    41
+A         33
+
+

+ Individual Query times:
+ Total query times are very similar between the 2 + systems but there were larger differences when you + looked at individual queries. +

+

+ For simple queries with small result sets Vendor A was + consistently faster than Lucene. For example a + single query might take vendor A 32 thousands of a + second and Lucene 64 thousands of a second. Both + times are however well within acceptable response + times for our application. +

+

+ For simple queries with large result sets Vendor A was + consistently slower than Lucene. For example a + single query might take vendor A 300 thousands of a + second and Lucene 200 thousands of a second. + For more complex queries of the form (term1 or term2 + or term3) AND (term4 or term5 or term6) AND (term7 or + term8) the results were more divergent. For + queries with small result sets Vendor A generally had + very short response times and sometimes Lucene had + significantly larger response times. For example + Vendor A might take 16 thousands of a second and + Lucene might take 156. I do not consider it to be + the case that Lucene's response time grew unexpectedly + but rather that Vendor A appeared to be taking + advantage of an optimization which Lucene didn't have. + (I believe there's been discussions on the dev + mailing list on complex queries of this sort.) +

+

+ Index Size:
+ For our test data the size of both indexes grew + linearly with the number of documents. Note that + these sizes are compact sizes, not maximum size during + index loading. The numbers below are from running du + -k in the directory containing the index data. The + larger number's below for Vendor A may be because it + supports additional functionality not available in + Lucene. I think it's the constant rate of growth + rather than the absolute amount which is more + important. +

+
+50K  documents
+Lucene      45516 K
+A           63921
+200K
+Lucene      171565
+A           228370
+400K
+Lucene      345717
+A           457843
+600K
+Lucene      511338
+A           684913
+
+

+ Indexing Times:
+ These times are for reading the documents from our + database, processing them, inserting them into the + document search product and index compacting. Our + data has a large number of fields/attributes. For + this test I restricted Lucene to 24 attributes to + reduce the number of files created. Doing this I was + able to specify a merge width for Lucene of 60. I + found in general that Lucene indexing performance to + be very sensitive to changes in the merge width. + Note also that our application does a full compaction + after inserting every 20,000 documents. These times + are just within our acceptable limits but we are + interested in alternatives to increase Lucene's + performance in this area. +

+

+

+600K documents
+Lucene       81 minutes
+A            34 minutes
+
+

+

+ (I don't have accurate results for all sizes on this + measure but believe that the indexing time for both + solutions grew essentially linearly with size. The + time to compact the index generally grew with index + size but it's a small percent of overall time at these + sizes.) +

+
    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: yes
  • +
  • CPU: Dell Pentium 4 CPU 2.00Ghz, 1cpu
  • +
  • RAM: 1 GB Memory
  • +
  • Drive configuration: Fujitsu MAM3367MP SCSI
  • +

    +

    + Software environment
    +

  • Java Version: 1.4.2_02
  • +
  • Java VM: JDK
  • +
  • OS Version: Windows XP
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: 600,000
  • +
  • Total filesize of source documents: from database
  • +
  • Average filesize of source documents: from database
  • +
  • Source documents storage location: from database
  • +
  • File type of source documents: XML
  • +
  • Parser(s) used, if any:
  • +
  • Analyzer(s) used: small variation on WhitespaceAnalyzer
  • +
  • Number of fields per document: 24
  • +
  • Type of fields: A1 keyword, 1 big unindexed, rest are unstored and a mix of tokenized/untokenized
  • +
  • Index persistence: FSDirectory
  • +
  • Index size: 12.5 GB
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs): 600,000 documents in 81 minutes (du -k = 511338)
  • +
  • Time taken / 1000 docs indexed: 123 documents/second
  • +
  • Memory consumption: -ms256m -mx512m -Xss4m -XX:MaxPermSize=512M
  • +

    +

    + Notes
    +

    +

  • merge width of 60
  • +
  • did a compact every 20,000 documents
  • +

    +

    +
+
+
+ + +
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/contributions.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/contributions.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/contributions.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/contributions.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,327 @@ + + +
+ + Apache Lucene - Contributions + +
+ + + Peter Carlson + + + +
+ Overview +

This page lists external Lucene resources. If you have + written something that should be included, please post all + relevant information to one of the mailing lists. Nothing + listed here is directly supported by the Lucene + developers, so if you encounter any problems with any of + this software, please use the author's contact information + to get help.

+

If you are looking for information on contributing patches or other improvements to Lucene, see + How To Contribute on the Lucene Wiki.

+
+ +
+ Lucene Tools +

+ Software that works with Lucene indices. +

+
Luke + + + + + + + + + +
+ URL + + + http://www.getopt.org/luke/ + +
+ author + + Andrzej Bialecki +
+
+
+ LIMO (Lucene Index Monitor) + + + + + + + + + +
+ URL + + + http://limo.sf.net/ + +
+ author + + Julien Nioche +
+
+
+ +
+ Lucene Document Converters +

+ Lucene requires information you want to index to be + converted into a Document class. Here are + contributions for various solutions that convert different + content types to Lucene's Document classes. +

+
+ XML Document #1 + + + + + + + + + +
+ URL + + + http://marc.theaimsgroup.com/?l=lucene-dev&m=100723333506246&w=2 + +
+ author + + Philip Ogren - ogren@mayo.edu +
+
+
+ XML Document #2 + + + + + + + + + +
+ URL + + + http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00346.html + +
+ author + + Peter Carlson - carlson@bookandhammer.com +
+
+
+ PDF Box + + + + + + + + + +
+ URL + + + http://www.pdfbox.org/ + +
+ author + + Ben Litchfield - ben@csh.rit.edu +
+
+
+ XPDF - PDF Document Conversion + + + + + + + + + +
+ URL + + + http://www.foolabs.com/xpdf + +
+ author + + N/A +
+
+
+ PDFTextStream -- PDF text and metadata extraction + + + + + + + + + +
+ URL + + + http://snowtide.com + +
+ author + + N/A +
+
+
+ PJ Classic & PJ Professional - PDF Document Conversion + + + + + + + + + +
+ URL + + + http://www.etymon.com/ + +
+ author + + N/A +
+
+
+ +
+ Miscellaneous +

+

+
+ Arabic Analyzer for Java + + + + + + + + + +
+ URL + + + http://savannah.nongnu.org/projects/aramorph + +
+ author + + Pierrick Brihaye +
+
+
+ Phonetix + + + + + + + + + +
+ URL + + + http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html + +
+ author + + tangentum technologies +
+
+
+ ejIndex - JBoss MBean for Lucene +

+

+ + + + + + + + + +
+ URL + + + http://ejindex.sourceforge.net/ + +
+ author + + Andy Scholz +
+
+
+ JavaCC + + + + + + + + + +
+ URL + + + https://javacc.dev.java.net/ + +
+ author + + Sun Microsystems (java.net) +
+
+
+ +
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,78 @@ + + +
+ + Apache Lucene - Building and Installing the Basic Demo + +
+ +Andrew C. Oliver + + + +
About this Document +

+This document is intended as a "getting started" guide to using and running the Lucene demos. +It walks you through some basic installation and configuration. +

+
+ + +
About the Demos +

+The Lucene command-line demo code consists of two applications that demonstrate various +functionalities of Lucene and how one should go about adding Lucene to their applications. +

+
+ +
Setting your CLASSPATH +

+First, you should download the +latest Lucene distribution and then extract it to a working directory. Alternatively, you can check out the sources from +Subversion, and then run ant war-demo to generate the JARs and WARs. +

+

+You should see the Lucene JAR file in the directory you created when you extracted the archive. It +should be named something like lucene-core-{version}.jar. You should also see a file +called lucene-demos-{version}.jar. If you checked out the sources from Subversion then +the JARs are located under the build subdirectory (after running ant +successfully). Put both of these files in your Java CLASSPATH. +

+
+ +
Indexing Files +

+Once you've gotten this far you're probably itching to go. Let's build an index! Assuming +you've set your CLASSPATH correctly, just type: + +

+    java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src
+
+ +This will produce a subdirectory called index which will contain an index of all of the +Lucene source code. +

+

+To search the index type: + +

+    java org.apache.lucene.demo.SearchFiles
+
+ +You'll be prompted for a query. Type in a swear word and press the enter key. You'll see that the +Lucene developers are very well mannered and get no results. Now try entering the word "vector". +That should return a whole bunch of documents. The results will page at every tenth result and ask +you whether you want more results. +

+
+ +
About the code... +

+read on>>> +

+
+ + +
+ Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo2.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo2.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo2.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo2.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,139 @@ + + +
+ + Apache Lucene - Basic Demo Sources Walk-through + +
+ +Andrew C. Oliver + + + +
About the Code +

+In this section we walk through the sources behind the command-line Lucene demo: where to find them, +their parts and their function. This section is intended for Java developers wishing to understand +how to use Lucene in their applications. +

+
+ + +
Location of the source + +

+Relative to the directory created when you extracted Lucene or retrieved it from Subversion, you +should see a directory called src which in turn contains a directory called +demo. This is the root for all of the Lucene demos. Under this directory is +org/apache/lucene/demo. This is where all the Java sources for the demos live. +

+ +

+Within this directory you should see the IndexFiles.java class we executed earlier. +Bring it up in vi or your editor of choice and let's take a look at it. +

+ +
+ +
IndexFiles + +

+As we discussed in the previous walk-through, the IndexFiles class creates a Lucene +Index. Let's take a look at how it does this. +

+ +

+The first substantial thing the main function does is instantiate IndexWriter. It passes the string +"index" and a new instance of a class called StandardAnalyzer. +The "index" string is the name of the filesystem directory where all index information +should be stored. Because we're not passing a full path, this will be created as a subdirectory of +the current working directory (if it does not already exist). On some platforms, it may be created +in other directories (such as the user's home directory). +

+ +

+The IndexWriter is the main +class responsible for creating indices. To use it you must instantiate it with a path that it can +write the index into. If this path does not exist it will first create it. Otherwise it will +refresh the index at that path. You can also create an index using one of the subclasses of Directory. In any case, you must also pass an +instance of org.apache.lucene.analysis.Analyzer. +

+ +

+The particular Analyzer we +are using, StandardAnalyzer, is +little more than a standard Java Tokenizer, converting all strings to lowercase and filtering out +useless words and characters from the index. By useless words and characters I mean common language +words such as articles (a, an, the, etc.) and other strings that would be useless for searching +(e.g. 's) . It should be noted that there are different rules for every language, and you +should use the proper analyzer for each. Lucene currently provides Analyzers for a number of +different languages (see the *Analyzer.java sources under contrib/analyzers/src/java/org/apache/lucene/analysis). +

+ +

+Looking further down in the file, you should see the indexDocs() code. This recursive +function simply crawls the directories and uses FileDocument to create Document objects. The Document is simply a data object to +represent the content in the file as well as its creation time and location. These instances are +added to the indexWriter. Take a look inside FileDocument. It's not particularly +complicated. It just adds fields to the Document. +

+ +

+As you can see there isn't much to creating an index. The devil is in the details. You may also +wish to examine the other samples in this directory, particularly the IndexHTML class. It is a bit more +complex but builds upon this example. +

+ +
+ +
Searching Files + +

+The SearchFiles class is +quite simple. It primarily collaborates with an IndexSearcher, StandardAnalyzer +(which is used in the IndexFiles class as well) and a +QueryParser. The +query parser is constructed with an analyzer used to interpret your query text in the same way the +documents are interpreted: finding the end of words and removing useless words like 'a', 'an' and +'the'. The Query object contains +the results from the QueryParser which is passed to +the searcher. Note that it's also possible to programmatically construct a rich Query object without using the query +parser. The query parser just enables decoding the Lucene query +syntax into the corresponding Query object. The searcher results are +returned in a collection of Documents called Hits which is then iterated through and +displayed to the user. +

+ +
+ +
The Web example... + +

+read on>>> +

+ +
+ + +
+ Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo3.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo3.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo3.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo3.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,90 @@ + + + +
+ + Apache Lucene - Building and Installing the Basic Demo + +
+ +Andrew C. Oliver + + + +
About this Document +

+This document is intended as a "getting started" guide to installing and running the Lucene +web application demo. This guide assumes that you have read the information in the previous two +examples. We'll use Tomcat as our reference web container. These demos should work with nearly any +container, but you may have to adapt them appropriately. +

+
+ + +
About the Demos +

+The Lucene Web Application demo is a template web application intended for deployment on Tomcat or a +similar web container. It's NOT designed as a "best practices" implementation by ANY means. It's +more of a "hello world" type Lucene Web App. The purpose of this application is to demonstrate +Lucene. With that being said, it should be relatively simple to create a small searchable website +in Tomcat or a similar application server. +

+
+ +
Indexing Files +

Once you've gotten this far you're probably itching to go. Let's start by creating the index +you'll need for the web examples. Since you've already set your CLASSPATH in the previous examples, +all you need to do is type: + +

+    java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..
+
+ +You'll need to do this from a (any) subdirectory of your {tomcat}/webapps directory +(make sure you didn't leave off the .. or you'll get a null pointer exception). +{index-dir} should be a directory that Tomcat has permission to read and write, but is +outside of a web accessible context. By default the webapp is configured to look in +/opt/lucene/index for this index. +

+
+ +
Deploying the Demos +

Located in your distribution directory you should see a war file called +luceneweb.war. If you're working with a Subversion checkout, this will be under the +build subdirectory. Copy this to your {tomcat-home}/webapps directory. +You may need to restart Tomcat.

+ +
Configuration +

From your Tomcat directory look in the webapps/luceneweb subdirectory. If it's not +present, try browsing to http://localhost:8080/luceneweb (which causes Tomcat to deploy +the webapp), then look again. Edit a file called configuration.jsp. Ensure that the +indexLocation is equal to the location you used for your index. You may also customize +the appTitle and appFooter strings as you see fit. Once you have finished +altering the configuration you may need to restart Tomcat. You may also wish to update the war file +by typing jar -uf luceneweb.war configuration.jsp from the luceneweb +subdirectory. (The -u option is not available in all versions of jar. In this case recreate the +war file). +

+
+ +
Running the Demos +

Now you're ready to roll. In your browser set the url to +http://localhost:8080/luceneweb enter test and the number of items per +page and press search.

+

You should now be looking either at a number of results (provided you didn't erase the Tomcat +examples) or nothing. If you get an error regarding opening the index, then you probably set the +path in configuration.jsp incorrectly or Tomcat doesn't have permissions to the index +(or you skipped the step of creating it). Try other search terms. Depending on the number of items +per page you set and results returned, there may be a link at the bottom that says More +Results>>; clicking it takes you to subsequent pages.

+ +
About the code... +

+If you want to know more about how this web app works or how to customize it then read on>>>. +

+
+ + +
+