lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by RobertMuir
Date Thu, 03 Nov 2011 23:24:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "LuceneFAQ" page has been changed by RobertMuir:
http://wiki.apache.org/lucene-java/LuceneFAQ?action=diff&rev1=150&rev2=151

Comment:
die optimize die

   * Always make sure that you ''explicitly'' close all file handles you open, especially
in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the
try block, close them in the finally block. Remember that Java doesn't have destructors, so
don't close file handles in a finalize method -- this method is not guaranteed to be executed.
   * Use the compound file format (it's activated by default starting with Lucene 1.4) by
calling  [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)|IndexWriter's
setUseCompoundFile(true)]]
   * Don't set [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor|IndexWriter's
mergeFactor]] to large values. Large values speed up indexing but increase the number of files
that need to be opened simultaneously.
-  * If the exception occurs during searching, optimize your index calling  [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#optimize()|IndexWriter's
optimize()]] method after indexing is finished.
   * Make sure you only open one IndexSearcher, and share it among all of the threads that
are doing searches -- this is safe, and it will minimize the number of files that are open
concurently.
   * Try to increase the number of files that can be opened simultaneously. On Linux using
bash this can be done by calling `ulimit -n <number>`.
  
@@ -194, +193 @@

  
  In other words, the number returned by `maxDoc()` does not necessarily match the actual
number of undeleted documents in the index.
  
- Deleted documents do not get removed from the index immediately, unless you call `optimize()`.
+ Deleted documents do not get removed from the index immediately, until they are merged away.
  
  ==== Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet"
or "fragment") to display along with the search result? ====
  You need to store the documents' summary in the index (use Field.Store.YES when creating
that field) and then use the Highlighter from the contrib area (distributed with Lucene since
version 1.9 as "lucene-highlighter-(version).jar"). It's important to use a rewritten query
as the input for the highlighter, i.e. call rewrite() on the query. Otherwise simple queries
will work but prefix queries etc will not be highlighted.
  
  For Lucene < 1.9, you can also get the "highlighter-dev.jar" from http://www.lucenebook.com/LuceneInAction.zip.
See http://www.gossamer-threads.com/lists/lucene/java-user/31595 for a discussion of this.
- 
- ==== Can I search an index while it is being optimized? ====
- Yes, an index can be searched and optimized simultaneously.
  
  ==== Can I cache search results with Lucene? ====
  Lucene does come with a simple cache mechanism, if you use [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Filter.html|Lucene
Filters]] . The classes to look at are [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html|CachingWrapperFilter]]
and [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html|QueryFilter]].
@@ -343, +339 @@

  
  So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M Lucene will generate
a series of 1M document indexes, since merging 10 of these would exceed the maximum.
  
- A slightly more complex solution:
- 
- You could further minimize the number of segments if, when you've added 7M documents, optimize
the index and start a new index.  Then use `MultiSearcher` to search the indexes.
- 
- An even more complex and optimal solution:
- 
- Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a subdirectory and
represents the file as a series of files.
- 
  ==== Why is it important to use the same analyzer type during indexing and search? ====
  The analyzer controls how the text is broken into terms which are then used to index the
document. If you are using an analyzer of one type to index and an analyzer of a different
type to parse the search query, it is possible that the same word will be mapped to two different
terms and this will result in missing or false hits.
  
@@ -358, +346 @@

  
  Also be careful with Fields that are not tokenized (like Keywords). During indexation, the
Analyzer won't be called for these fields, but for a search, the !QueryParser can't know this
and will pass all search strings through the selected Analyzer.  Usually searches for Keywords
are constructed in code, but during development it can be handy to use general purpose tools
(e.g. Luke) to examine your index.  Those tools won't know which fields are tokenized either.
 In the contrib/analyzers area there's a !KeywordTokenizer with an example !KeywordAnalyzer
for cases like this.
  
- ==== What is index optimization and when should I use it? ====
- The !IndexWriter class supports an optimize() method that compacts the index database and
speeds up queries. You may want to use this method after performing a complete indexing of
your document set or after incremental updates of the index. If your incremental update adds
documents frequently, you want to perform the optimization only once in a while to avoid the
extra overhead of the optimization.
- 
  ==== What are Segments? ====
- The index database is composed of 'segments' each stored in a separate file. When you add
documents to the index, new segments may be created. You can compact the database and reduce
the number of segments by optimizing it (see a separate question regarding index optimization).
+ The index database is composed of 'segments' each stored in a separate file. When you add
documents to the index, new segments may be created. These are periodically merged together.
  
  ==== Is Lucene index database platform independent? ====
  Yes, you can copy a Lucene index directory from one platform to another and it will work
just as well.
@@ -377, +362 @@

  Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe (it is a `synchronized`
method). !IndexWriter in general is thread safe, i.e. you should use the same !IndexWriter
object from all of your threads. Actually it's impossible to use more than one !IndexWriter
for the same index directory, as this will lead to an exception trying to create the lock
file.
  
  ==== When is it possible for document IDs to change? ====
- Documents are only re-numbered after there have been deletions.  Once there have been deletions,
renumbering may be triggered by any document addition or index optimization.  Once an index
is optimized, no renumbering will be performed until more deletions are made.
+ Documents can be re-numbered at anytime by Lucene
  
- If you require a persistent document id that survives deletions, then add it as a field
to your documents.
+ If you require a persistent document id, then add it as a field to your documents.
  
  ==== What is the purpose of write.lock file, when is it used, and by which classes? ====
  The write.lock is used to keep processes from concurrently attempting to modify an index.
@@ -405, +390 @@

  If you are certain that a lock file is not in use, you can delete it manually.  You should
also look at the methods "[[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#isLocked(org.apache.lucene.store.Directory)|IndexReader.isLocked]]"
and "[[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#unlock(org.apache.lucene.store.Directory)|IndexReader.unlock]]"
if you are interested in writing recovery code that can remove locks automatically.
  
  ==== Is there a maximum number of segment infos whose summary (name and document count)
is stored in the segments file? ====
- All segments in the index are listed in the segments file.  There is no hard limit. For
an un-optimized index it is proportional to the log of the number of documents in the index.
An optimized index contains a single segment.
+ All segments in the index are listed in the segments file.  There is no hard limit. For
a normal index it is proportional to the log of the number of documents in the index.
- 
- ==== What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter?
 Which files will be added or modified? ====
- All of the segments are merged into a single new segment file. If the index was empty to
begin with, no segments will be created, only the `segments` file.
- 
- ==== If I decide not to optimize the index, when will the deleted documents actually get
deleted? ====
- Documents that are deleted are marked as deleted.  However, the space they consume in the
index does not get reclaimed until the index is optimized.  That space will also eventually
be reclaimed as more documents are added to the index, even if the index does not get optimized.
  
  ==== How do I update a document or a set of documents that are already indexed? ====
  There is no direct update procedure in Lucene. To update an index incrementally you must
first '''delete''' the documents that were updated, and '''then re-add''' them to the index.
@@ -524, +503 @@

  
  Note that the article uses an older version of apache lucene. For parsing the java source
files and extracting that information, the [[http://help.eclipse.org/help33/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/ASTParser.html|ASTParser]]
of the [[http://www.eclipse.org/jdt/|eclipse java development tools]] is used.
  
- ==== If I use a compound file-style index, can I still optimize my index? ====
- Yes.  Each .cfs file created in the compound file-style index represents a single segment,
which means you can still merge multiple segments into a single segment by optimizing the
index.
- 
  ==== What is the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]),
besides them taking different arguments? ====
  When merging lots of indexes (more than the mergeFactor), the Directory-based method will
use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once,
while the !IndexReader-based method requires that all indexes be open when passed.
  
@@ -535, +511 @@

  ==== Can I use Lucene to index text in Chinese, Japanese, Korean, and other multi-byte character
sets? ====
  Yes, you can.  Lucene is not limited to English, nor any other language.  To index text
properly, you need to use an Analyzer appropriate for the language of the text you are indexing.
 Lucene's default Analyzers work well for English.  There are a number of other Analyzers
in [[http://lucene.apache.org/java/docs/lucene-sandbox/|Lucene Sandbox]], including those
for Chinese, Japanese, and Korean.
  
- ==== Why do I have a deletable file (and old segment files remain) after running optimize?
====
+ ==== Why do I have a deletable file (and old segment files remain) after merging? ====
- This is normal behavior on Windows whenever you also have readers (IndexReaders or IndexSearchers)
open against the index you are optimizing.  Lucene tries to remove old segments files once
they have been merged (optimized).  However, because Windows does not allow removing files
that are open for reading, Lucene catches an IOException deleting these files and and then
records these pending deletable files into the "deletable" file.  On the next segments merge,
which happens with explicit optimize() or close() calls and also whenever the IndexWriter
flushes its internal RAMDirectory to disk (every IndexWriter.DEFAULT_MAX_BUFFERED_DOCS (default
10) addDocuments), Lucene will try again to delete these files (and additional ones) and any
that still fail will be rewritten to the deletable file.
+ This is normal behavior on Windows whenever you also have readers (IndexReaders or IndexSearchers)
open against the index you are optimizing.  Lucene tries to remove old segments files once
they have been merged.  However, because Windows does not allow removing files that are open
for reading, Lucene catches an IOException deleting these files and and then records these
pending deletable files into the "deletable" file.  On the next segments merge, Lucene will
try again to delete these files (and additional ones) and any that still fail will be rewritten
to the deletable file.
  
  Note that as of 2.1 the deletable file is no longer used.  Instead, Lucene computes which
files are no longer referenced by the index and removes them whenever a writer is created.
  

Mime
View raw message