lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by AlexLambert
Date Wed, 26 Jan 2011 23:21:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "LuceneFAQ" page has been changed by AlexLambert.
The comment on this change is: finding the referenced mailing list post.
http://wiki.apache.org/lucene-java/LuceneFAQ?action=diff&rev1=147&rev2=148

--------------------------------------------------

  This is the official Lucene FAQ.
  
- If you have a question about using Java Lucene, please do not add it directly to this FAQ.  Join the [[http://lucene.apache.org/java/docs/mailinglists.html|Java User mailing list]] and email your question there.  
+ If you have a question about using Java Lucene, please do not add it directly to this FAQ.  Join the [[http://lucene.apache.org/java/docs/mailinglists.html|Java User mailing list]] and email your question there.
  
  '''''Questions should only be added to this Wiki page when they already have an answer that can be added at the same time.'''''
  
- 
  <<TableOfContents>>
  
  == Lucene FAQ ==
- 
  === General ===
- 
- 
  ==== How do I start using Lucene? ====
- 
  Lucene has no external dependencies, so just add lucene-core-x.y-dev.jar to your development environment's classpath. After that,
  
   * read the [[http://lucene.apache.org/java/2_4_1/api/overview-summary.html#overview_description|Javadoc introduction]]
@@ -25, +20 @@

  If you think Lucene is too low-level for you, you might want to consider using [[http://lucene.apache.org/solr/|Solr]], which usually requires less Java programming.
  
  ==== Are there any mailing lists available? ====
- 
  There's a user list and a developer list, both available at http://lucene.apache.org/java/docs/mailinglists.html .
  
- 
  ==== What Java version is required to run Lucene? ====
- 
  Lucene >= 1.9 requires Java 1.4. Lucene 1.4 will run with JDK 1.3 and up but requires at least JDK 1.4 to compile.
  
- 
  ==== Will Lucene work with my Java application? ====
- 
  Yes, Lucene is 100% pure Java and has no external dependencies.
  
- 
  ==== How can I get the latest greatest development code? ====
- 
  See SourceRepository
  
  ==== Where can I get the javadocs for the org.apache.lucene classes? ====
- 
  The docs for all the classes are available online at http://lucene.apache.org/java/docs/api/index.html. In addition, they are a part of the standard distribution, and you can always recreate them by running `ant javadocs`.
  
- 
  ==== Where does the name Lucene come from? ====
- 
  Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's first name.
  
- 
  ==== Are there any alternatives to Lucene? ====
- 
  Besides commercial products which we don't know much about there's also [[http://www.egothor.org|Egothor]]. Also check the list of [[http://wiki.apache.org/jakarta-lucene/LuceneImplementations|Lucene implemenations]].
  
- 
  ==== Does Lucene have a web crawler? ====
- 
  No, but check out [[http://lucene.apache.org/nutch/|Nutch]] and the [[http://java-source.net/open-source/crawlers|list of Open Source Crawlers in Java]].
  
- 
  ==== Why am I getting an IOException that says "Too many open files"? ====
- 
  The number of files that can be opened simultaneously is a system-wide limitation of your operating system. Lucene might cause this problem as it can open quite some files depending on how you use it, but the problem might also be somewhere else.
  
   * Always make sure that you ''explicitly'' close all file handles you open, especially in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the try block, close them in the finally block. Remember that Java doesn't have destructors, so don't close file handles in a finalize method -- this method is not guaranteed to be executed.
   * Use the compound file format (it's activated by default starting with Lucene 1.4) by calling  [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)|IndexWriter's setUseCompoundFile(true)]]
   * Don't set [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor|IndexWriter's mergeFactor]] to large values. Large values speed up indexing but increase the number of files that need to be opened simultaneously.
   * If the exception occurs during searching, optimize your index calling  [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#optimize()|IndexWriter's optimize()]] method after indexing is finished.
-  * Make sure you only open one Index``Searcher, and share it among all of the threads that are doing searches -- this is safe, and it will minimize the number of files that are open concurently.
+  * Make sure you only open one IndexSearcher, and share it among all of the threads that are doing searches -- this is safe, and it will minimize the number of files that are open concurently.
   * Try to increase the number of files that can be opened simultaneously. On Linux using bash this can be done by calling `ulimit -n <number>`.
  
  ==== When I compile Lucene x.y.z from source, the version number in the jar file name and MANIFEST.MF is different. What's up with that? ====
- 
  This is intentional. Only the jar files produced by the Lucene release manager will have the exact release number. Any other builds will have a different release number in order to help differentiate them from the code produced by the release process. Feel free to adjust.
  
  ==== How do I contribute an improvement? ====
- 
- Please follow all of [[HowToContribute| these steps to submit a Lucene patch]].
+ Please follow all of [[HowToContribute|these steps to submit a Lucene patch]].
  
  ==== Why hasn't patch FOO been committed? ====
- 
  [[http://www.apache.org/foundation/how-it-works.html#committers|Committers]] are at their own discretion to decide what patches are suitable for being committed.  Generally speaking, committers are encouraged to be conservative about what patches they commit.  By committing code into the code base, he or she vouches for the quality of that patch.  Any problems that ensue are, to some degree, the responsibility of that committer.  If a committer does not feel comfortable making changes to particular sections of the code base, they may wish to consult (or defer to) a more senior committer.
  
  The best way to encourage committers to commit a particular patch is to make it easy to apply. At a minimum it should apply easily to trunk and pass all unit tests.  It should confine itself to a single issue: changing as little as possible; adding as little as possible.  The patch should include new unit tests which demonstrate the bug the patch fixes (or the new functionality the patch adds).  The case is stronger if others report to have successfully applied the patch and found it useful.
@@ -91, +67 @@

  If one feels a patch is neglected one should be persistent, polite and patient.
  
  ==== What are the backwards compatibility commitments? ====
- 
- Here are the [[BackwardsCompatibility| compatibility commitments]].
+ Here are the [[BackwardsCompatibility|compatibility commitments]].
  
  ==== How do I get code written for Lucene 1.4.x to work with Lucene 2.x? ====
- 
  The upgrade path for Lucene 2.0 was designed around the notion of clear deprecation warnings.  Any code designed to use the APIs in Lucene 1.4.x should compile/function with Lucene 1.9 -- however many compile time deprecation warnings will be generated identifying methods that should no longer be used, and what new methods should be used instead.
  
  If you have code that worked with Lucene 1.4.x, and you want to "port" it to Lucene 2.x you should start by downloading the [[http://www.apache.org/dyn/closer.cgi/lucene/java/archive|1.9 release of Lucene]], and compile the code against it.  Make sure deprecation warnings are turned on in your development environment, and gradually change your code until all deprecation warnings go away (the !DateField class is an exception, it has not been removed in Lucene 2.0 yet).
@@ -109, +83 @@

   1. Describe your problem, giving details about how you are using Lucene
   1. What version of Lucene are you using?  What JDK?  Can you upgrade to the latest?
   1. Make sure it truly is a Lucene problem.  That is, isolate the problem and/or profile your application.
-  1. Search the java-user and java-dev Mailing lists, see [[http://lucene.apache.org/java/docs/mailinglists.html]]
+  1. Search the java-user and java-dev Mailing lists, see http://lucene.apache.org/java/docs/mailinglists.html
  
  ==== What does l.a.o and o.a.l.xxxx stand for? ====
- 
  l.a.o is shorthand for lucene.apache.org (ie: the website)
  
  o.a.l.xxxx is shorthand for org.apache.lucene.xxxx (ie: the java package namespace)
  
  ==== What is the difference between field (or document) boosting and query boosting? ====
+ Index time field boosts (field.setBoost(boost)) are a way to express things like "this document's title is worth twice as much as the title of most documents". Query time boosts (query.setBoost(boost)) are a way to express "I care about matches on this clause of my query twice as much as I do about matches on other clauses of my query".
- 
- Index time field boosts (field.setBoost(boost)) are a way to express things like "this document's
- title is worth twice as much as the title of most documents". Query time
- boosts (query.setBoost(boost)) are a way to express "I care about matches on this clause of my query
- twice as much as I do about matches on other clauses of my query".
  
  Index time field boosts are worthless if you set them on every document.
  
  Index time document boosts (doc.setBoost(float)) are equivalent to setting a field boost on ever field in that document.
  
  === Searching ===
- 
  ==== Does Lucene allow searching and indexing simultaneously? ====
+ Yes.  However, an !IndexReader only searches the index as of the "point in time" that it was opened.  Any updates to the index, either added or deleted documents, will not be visible until the !IndexReader is re-opened. So your application must periodically re-open its !IndexReaders to see the latest updates.  The [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#isCurrent()|IndexReader.isCurrent()]] method allows you to test whether any updates have occurred to the index since your !IndexReader was opened.
- 
- Yes.  However, an !IndexReader only searches the index as of the
- "point in time" that it was opened.  Any updates to the index, either
- added or deleted documents, will not be visible until the !IndexReader is re-opened.
- So your application must periodically re-open its !IndexReaders to see
- the latest updates.  The
- [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#isCurrent()|IndexReader.isCurrent()]] method allows you to test whether any updates
- have occurred to the index since your !IndexReader was opened.
  
  ==== Why am I getting no hits / incorrect hits? ====
- 
  Some possible causes:
  
   * The desired term is in a field that was not defined as 'indexed'. Re-index the document and make the field indexed.
@@ -161, +121 @@

   * Use the Query's toString() method to see how it actually got parsed.
   * Use [[http://www.getopt.org/luke/|Luke]] to browse your index: on the "Documents" tab, navigate to a document, then use the "Reconstruct & Edit" to look at how the fields have been stored ("Stored original" tab) and indexed ("Tokenized" tab)
  
- 
  ==== Why am I getting a TooManyClauses exception? ====
+ The following types of queries are expanded by Lucene before it does the search: !RangeQuery, !PrefixQuery, !WildcardQuery, !FuzzyQuery. For example, if the indexed documents contain the terms "car" and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the !TooManyClauses exception:
- 
- The following types of queries are expanded by Lucene before it does the search: !RangeQuery,
- !PrefixQuery, !WildcardQuery, !FuzzyQuery. For example, if the indexed documents contain the terms "car"
- and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The
- number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the !TooManyClauses exception:
  
   * Use a filter to replace the part of the query that causes the exception.  For example, a !RangeFilter can replace a !RangeQuery on date fields and it will never throw the !TooManyClauses exception --  You can even use !ConstantScoreRangeQuery to execute your !RangeFilter as a Query.  Note that filters are slower than queries when used for the first time, so you should cache them using [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html|CachingWrapperFilter]].  Using Filters in place of Queries generated by !QueryParser can be achieved by subclassing !QueryParser and overriding the appropriate function to return a !ConstantScore version of your Query.
   * Increase the number of terms using [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)|BooleanQuery.setMaxClauseCount()]]. Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use !BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE).
   * A specfic solution that can work on very precise fields is to reduce the precision of the data in order to reduce the number of terms in the index. For example, the !DateField class uses a microsecond resultion, which is often not required. Instead you can save your dates in the "yyyymmddHHMM" format, maybe even without hours and minutes if you don't need them (this was simplified in Lucene 1.9 thanks to the new !DateTools class).
  
- 
- 
  ==== How can I search over multiple fields? ====
- 
  Searching over multiple fields is what people expect as Google searches all the fields by default. You have to parse the query using [[http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/queryParser/MultiFieldQueryParser.html|MultiFieldQueryParser]]. Note that terms which occur in short fields have a higher effect on the result ranking.
  
- Alternatively you could create a field which concatenates the content you would like to search and
+ Alternatively you could create a field which concatenates the content you would like to search and search only that field.
- search only that field.
- 
  
  ==== What wildcard search support is available from Lucene? ====
- 
  Lucene supports wild card queries which allow you to perform searches such as ''book*'', which will find documents containing terms such as ''book'', ''bookstore'', ''booklet'', etc. Lucene refers to this type of a query as a 'prefix query'.
  
  Lucene also supports wild card queries which allow you to place a wild card in the middle of the query term. For instance, you could make searches like: ''mi*pelling''. That will match both ''misspelling'', which is the correct way to spell this word, as well as ''mispelling'', which is a common spelling mistake.
@@ -193, +142 @@

  
  Leading wildcards (e.g. ''*ook'') are '''not''' supported by the !QueryParser by default. As of Lucene 2.1, they can be enabled by calling `QueryParser.setAllowLeadingWildcard( true )`. Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that match the pattern.
  
- 
  ==== Can I combine wildcard and phrase search, e.g. "foo ba*"? ====
- 
  This is not supported by !QueryParser, but you could extend the !QueryParser to build a [[http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/MultiPhraseQuery.html|MultiPhraseQuery]] in those cases.
  
- 
  ==== Is the QueryParser thread-safe? ====
- 
  No, it's not.
  
- 
  ==== How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this? ====
- 
  The [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html|QueryFilter class]] is designed precisely for such cases.
  
  Another way of doing it is the following:
@@ -214, +157 @@

  
  If you are restricting access with a prohibited term, and someone tries to require that term, then the prohibited restriction wins. If you are restricting access with a required term, and they try prohibiting that term, then they will get no documents in their search result.
  
+ As for deciding whether to use required or prohibited terms, if possible, you should choose the method that names the less frequent term.  That will make queries faster.
- As for deciding whether to use required or prohibited terms, if possible,
- you should choose the method that names the less frequent term.  That will
- make queries faster.
- 
  
  ==== What is the order of fields returned by Document.fields()? ====
- 
  Fields are returned in the same order they were added to the document.
  
  '''NOTE:'''This functionality was broken in 2.3 and 2.4, but will be fixed in 2.9; see [[https://issues.apache.org/jira/browse/LUCENE-1727|LUCENE-1727]]
  
- 
  ==== How does one determine which documents do not have a certain term? ====
- 
  There is no direct way of doing that.  You could add a term "x" to every document, and then search for "+x -y" to find all of the documents that don't have "y". Note that for large collections this would be slow because of the high term frequency for term "x".
  
  Lucene 1.9 added [[http://svn.apache.org/viewcvs.cgi/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/MatchAllDocsQuery.java|MatchAllDocsQuery]] to make this easier.
  
  ==== How do I get the last document added that has a particular term? ====
- 
  Call:
  
  `TermDocs td = IndexReader.termDocs(Term);`
  
  Then grab the last `Term` in `TermDocs` that this method returns.
  
- 
  ==== Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other? ====
- 
  `MultiSearcher` searches indices sequentially. Use [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html|ParallelMultiSearcher]] as a searcher that performs multiple searches in parallel. Please note that there's a [[http://issues.apache.org/bugzilla/show_bug.cgi?id=31841|known bug]] in Lucene < 1.9 in the !MultiSearcher's result ranking.
  
- 
  ==== Is there a way to use a proximity operator (like near or within) with Lucene? ====
- 
  There is a variable called `slop` in `PhraseQuery` that allows you to perform NEAR/WITHIN-like queries.
  
+ By default, `slop` is set to 0 so that only exact phrases will match. However, you can alter the value using the `setSlop(int)` method.
- By default, `slop` is set to 0 so that only exact phrases will match.
- However, you can alter the value using the `setSlop(int)` method.
  
  When using !QueryParser you can use this syntax to specify the slop: "doug cutting"~2 will find documents that contain "doug cutting" as well as ones that contain "cutting doug".
  
- 
  ==== Are Wildcard, Prefix, and Fuzzy queries case sensitive? ====
+ No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the `Analyzer`, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the `Analyzer` is that if you were searching for ''"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since that would then match ''"dog*"'', which is not the intended query. These queries are case-insensitive anyway because `QueryParser` makes them lowercase. This behavior can be changed using the [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html#setLowercaseExpandedTerms(boolean)|setLowercaseExpandedTerms(boolean)]] method.
- 
- No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the `Analyzer`, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the `Analyzer` is that if you were searching for ''"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since that would then match ''"dog*"'', which is not the
- intended query. These queries are case-insensitive anyway because `QueryParser` makes them lowercase. This behavior can be changed using the [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html#setLowercaseExpandedTerms(boolean)|setLowercaseExpandedTerms(boolean)]] method.
- 
  
  ==== Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes? ====
- 
  According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns one greater than the largest possible document number".''
  
  In other words, the number returned by `maxDoc()` does not necessarily match the actual number of undeleted documents in the index.
  
  Deleted documents do not get removed from the index immediately, unless you call `optimize()`.
  
- 
  ==== Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result? ====
- 
  You need to store the documents' summary in the index (use Field.Store.YES when creating that field) and then use the Highlighter from the contrib area (distributed with Lucene since version 1.9 as "lucene-highlighter-(version).jar"). It's important to use a rewritten query as the input for the highlighter, i.e. call rewrite() on the query. Otherwise simple queries will work but prefix queries etc will not be highlighted.
  
  For Lucene < 1.9, you can also get the "highlighter-dev.jar" from http://www.lucenebook.com/LuceneInAction.zip. See http://www.gossamer-threads.com/lists/lucene/java-user/31595 for a discussion of this.
  
  ==== Can I search an index while it is being optimized? ====
- 
  Yes, an index can be searched and optimized simultaneously.
  
- 
  ==== Can I cache search results with Lucene? ====
- 
- Lucene does come with a simple cache mechanism, if you use [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Filter.html|Lucene Filters]] .
- The classes to look at are [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html|CachingWrapperFilter]] and [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html|QueryFilter]].
+ Lucene does come with a simple cache mechanism, if you use [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Filter.html|Lucene Filters]] . The classes to look at are [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html|CachingWrapperFilter]] and [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html|QueryFilter]].
  
  Also consider using a JSP tag for caching, see http://www.opensymphony.com/oscache/ for one tab library that's easy and works well.
  
- 
  ==== Is the IndexSearcher thread-safe? ====
- 
  Yes, !IndexSearcher is thread-safe.  Multiple search threads may use the same instance of !IndexSearcher concurrently without any problems. It is recommended to use only one !IndexSearcher from all threads in order to save memory.
  
- 
  ==== Is there a way to retrieve the original term positions during the search? ====
- 
  Yes, see the Javadoc for [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#termPositions()|IndexReader.termPositions()]].
  
- 
  ==== How do I retrieve all the values of a particular field that exists within an index, across all documents? ====
+ The trick is to enumerate terms with that field.  Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations.  Term enumeration is also efficient.
- 
- The trick is to enumerate terms with that field.  Terms are sorted first
- by field, then by text, so all terms with a given field are adjacent in
- enumerations.  Term enumeration is also efficient.
  
  {{{
  try
@@ -323, +235 @@

      terms.close();
  }
  }}}
- 
- 
  ==== Can Lucene do a "search within search", so that the second search is constrained by the results of the first query? ====
- 
  Yes.  There are two primary options:
  
-  * Use `QueryFilter` with the previous query as the filter. (you can search the mailing list archives for `QueryFilter` and Doug Cutting's recommendations against using it for this purpose)
+  * Use `QueryFilter` with the previous query as the filter. Doug Cutting [[http://mail-archives.apache.org/mod_mbox/lucene-dev/200208.mbox/%3C3D4EC292.8040203@lucene.com%3E|recommends against this]], because a QueryFilter does not affect ranking.
   * Combine the previous query with the current query using `BooleanQuery`, using the previous query as required.
  
  The `BooleanQuery` approach is the recommended one.
  
- 
  ==== Does the position of the matches in the text affect the scoring? ====
- 
  No, the position of matches within a field does not affect ranking.
  
- 
  ==== How do I make sure that a match in a document title has greater weight than a match in a document body? ====
- 
  If you put the title in a separate field from the body, and search both fields, matches in the title will usually be stronger without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the body.  Therefore, even without boosting, title matches usually come before body matches. But you can also boost queries on title by using query.setBoost(boost) on the relevant clause.
  
- 
  ==== How do I find similar documents? ====
- 
  See the !MoreLikeThis class in the org.apache.lucene.search.similar package.   In Lucene 1.9 it was in the "similarity" contrib jar, but starting with Lucene 2.1 it was moved to the new "queries" contrib.
  
  ==== Can I filter by score? ====
- 
  Not safely.  You can always pick an arbitrary score value and then check the Hits object to see how many results have a score higher than that value (a Binary search might come in handy) but it really doesn't give you any meaningful information because of the way score is calculated...
  
  [[http://article.gmane.org/gmane.comp.jakarta.lucene.user/12076|One Explanation...]]
+ 
  {{{
    > Does anyone have an example of limiting results returned based on a
    > score threshold? For example if I'm only interested in documents with
@@ -366, +269 @@

  returned, at least at present, so there is not a way to determine from
  the scores what the quality of the result set is overall.
  }}}
- 
  For more detailed discussion, please read ScoresAsPercentages
  
  ==== How can I cluster results, i.e. create groups of similar documents? ====
- 
  Check out [[http://www.carrot2.org|Carrot]], a clustering framework that can be used with Lucene.
  
  ==== How do I implement paging, i.e. showing result from 1-10, 11-20 etc? ====
- 
  Just re-execute the search and ignore the hits you don't want to show. As people usually look only at the first results this approach is usually fast enough.
  
  ==== How do I speed up searching? ====
- 
  See ImproveSearchingSpeed.
  
  === Indexing ===
- 
  ==== Can I use Lucene to crawl my site or other sites on the Internet? ====
- 
  No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focuses on the indexing and searching and does it great. However, several crawlers are available which you could use: [[http://java-source.net/open-source/crawlers|list of Open Source Crawlers in Java]]. [[http://regain.sourceforge.net|regain]] is an Open Source tool that crawls web sites, stores them in a Lucene index and offers a search web interface. Also see [[http://lucene.apache.org/nutch|Nutch]] for a powerful Lucene-based search engine.
  
  ==== How can I use Lucene to index a database? ====
- 
  Connect to the database using JDBC and use an SQL "SELECT" statement to query the database. Then create one Lucene Document object per row and add it to the index. You will probably want to store the ID column so you can later access the matching items. For other (text) columns it might make more sense to only index (not store) them, as the original data is still available in your database.
  
  For a more high level approach you might want to have a look at [[http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql|LuSql]] (a specialized tool for moving data from JDBC-accessible databases into Lucene), [[http://search.hibernate.org|Hibernate Search]], [[http://www.compass-project.org/|Compass]], [[http://www.dbsight.net/|DBSight]], or [[http://wiki.apache.org/solr/DataImportHandler|Solr's Data Import Handler]] which all use Lucene internally.
  
  ==== How do I perform a simple indexing of a set of documents? ====
- 
  The easiest way is to re-index the entire document set periodically or whenever it changes. All you need to do is to create an instance of !IndexWriter(), iterate over your document set, create for each document a Lucene Document object and add it to the !IndexWriter. When you are done make sure to close the !IndexWriter. This will release all of its resources and will close the files it created.
  
- 
  ==== How can I add document(s) to the index? ====
- 
  Simply create an !IndexWriter and use its addDocument() method. Make sure to create the !IndexWriter with the 'create' flag set to false and make sure to close the !IndexWriter when you are done adding the documents.
  
- 
  ==== Where does Lucene store the index it builds? ====
- 
  Typically, the index is stored in a set of files that Lucene creates in a directory of your choice. If your system uses multiple independent indices, simply create a separate directory for each index.
  
  Lucene's API also provide a way to use or implement other storage methods such as a in-memory storage (RAMDirectory), or a mapping of Lucene data to any third party database (not included in Lucene).
  
- 
  ==== Can I store the Lucene index in a relational database? ====
- 
  Lucene does not support that functionality out of the box, but several people have implemented [[http://www.google.com/search?q=jdbcdirectory%20lucene|JdbcDirectory's]].  The reports we have seen so far indicate that performance with such implementations is not great, but it is doable.
  
- 
  ==== Can I store the Lucene index in a BerkeleyDB? ====
- 
  Yes, you use BerkeleyDB as the Lucene index store.  Just use !DbDirectory implementation from Lucene's contrib section.
  
- 
  ==== I get "No tvx file". What does that mean? ====
- 
  It's a "warning" that can safely be ignored. It has been fixed (i.e. the warning has been removed) in Lucene 1.9.
  
  ==== Does Lucene store a full copy of the indexed documents? ====
- 
  It is up to you. You can tell Lucene what document information to use just for indexing and what document information to also store in the index (with or without indexing).
  
  ==== What is the different between Stored, Tokenized, Indexed, and Vector? ====
- 
   * Stored = as-is value stored in the Lucene index
   * Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
   * Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
   * Vectored = term frequency per document is stored in the index in an easily retrievable fashion.
  
- 
  ==== What happens when you IndexWriter.add() a document that is already in the index?  Does it overwrite the previous document? ====
- 
  No, there will be multiple copies of the same document in the index.
  
- 
  ==== How do I delete documents from the index? ====
+ `IndexWriter` allows you to delete by `Term` or by `Query`.  The deletes are buffered and then periodically flushed to the index, and made visible once `commit()` or `close()` is called.
  
+ `IndexReader` can also delete documents, by `Term` or document number, but you must close any open `IndexWriter` before using `IndexReader` to make changes (and, vice/versa).  `IndexReader` also buffers the deletions and does not write changes to the index until `close()` is called, but if you use that same `IndexReader` for searching, the buffered deletions will immediately take effect.  Unlike `IndexWriter`'s delete methods, `IndexReader`'s methods return the number of documents that were deleted.
- `IndexWriter` allows you to delete by `Term` or by `Query`.  The
- deletes are buffered and then periodically flushed to the index, and
- made visible once `commit()` or `close()` is called.
  
+ Generally it's best to use `IndexWriter` for deletions, unless 1) you must delete by document number, 2) you need your searches to immediately reflect the deletions or 3) you must know how many documents were deleted for a given deleteDocuments invocation.
- `IndexReader` can also delete documents, by `Term` or document number,
- but you must close any open `IndexWriter` before using `IndexReader`
- to make changes (and, vice/versa).  `IndexReader` also buffers the
- deletions and does not write changes to the index until `close()` is
- called, but if you use that same `IndexReader` for searching, the
- buffered deletions will immediately take effect.  Unlike
- `IndexWriter`'s delete methods, `IndexReader`'s methods return the
- number of documents that were deleted.
  
+ If you must delete by document number but would otherwise like to use `IndexWriter`, one common approach is to make a primary key field, that holds a unique ID string for each document.  Then you can delete a single document by creating the `Term` containing the ID, and passing that to `IndexWriter`'s `deleteDocuments(Term)` method.
- Generally it's best to use `IndexWriter` for deletions, unless 1) you
- must delete by document number, 2) you need your searches to
- immediately reflect the deletions or 3) you must know how many
- documents were deleted for a given deleteDocuments invocation.
  
+ Once a document is deleted it will not appear in `TermDocs` nor `TermPositions` enumerations, nor any search results.  Attempts to load the document will result in an exception.  The presence of this document may still be reflected in the `docFreq` statistics, and thus alter search scores, though this will be corrected eventually as segments containing deletions are merged.
- If you must delete by document number but would otherwise like to use
- `IndexWriter`, one common approach is to make a primary key field,
- that holds a unique ID string for each document.  Then you can delete
- a single document by creating the `Term` containing the ID, and
- passing that to `IndexWriter`'s `deleteDocuments(Term)` method.
- 
- Once a document is deleted it will not appear in `TermDocs` nor
- `TermPositions` enumerations, nor any search results.  Attempts to
- load the document will result in an exception.  The presence of this
- document may still be reflected in the `docFreq` statistics, and thus
- alter search scores, though this will be corrected eventually as
- segments containing deletions are merged.
- 
  
  ==== Is there a way to limit the size of an index? ====
- 
  This question is sometimes brought up because of the 2GB file size limit of some 32-bit operating systems.
  
  This is a slightly modified answer from Doug Cutting:
@@ -495, +351 @@

  
  Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files.
  
- 
  ==== Why is it important to use the same analyzer type during indexing and search? ====
- 
  The analyzer controls how the text is broken into terms which are then used to index the document. If you are using an analyzer of one type to index and an analyzer of a different type to parse the search query, it is possible that the same word will be mapped to two different terms and this will result in missing or false hits.
  
  '''NOTE:''' It's not a rule that the same analyzer be used for both indexing and searching, and there are cases where it makes sense to use different ones (ie: when dealing with synonyms).  The analyzers must be compatible though.
@@ -505, +359 @@

  Also be careful with Fields that are not tokenized (like Keywords). During indexation, the Analyzer won't be called for these fields, but for a search, the !QueryParser can't know this and will pass all search strings through the selected Analyzer.  Usually searches for Keywords are constructed in code, but during development it can be handy to use general purpose tools (e.g. Luke) to examine your index.  Those tools won't know which fields are tokenized either.  In the contrib/analyzers area there's a !KeywordTokenizer with an example !KeywordAnalyzer for cases like this.
  
  ==== What is index optimization and when should I use it? ====
- 
  The !IndexWriter class supports an optimize() method that compacts the index database and speeds up queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.
  
  ==== What are Segments? ====
- 
  The index database is composed of 'segments' each stored in a separate file. When you add documents to the index, new segments may be created. You can compact the database and reduce the number of segments by optimizing it (see a separate question regarding index optimization).
  
- 
  ==== Is Lucene index database platform independent? ====
- 
  Yes, you can copy a Lucene index directory from one platform to another and it will work just as well.
  
- 
  ==== When I recreate an index from scratch, do I have to delete the old index files? ====
- 
  No, creating the !IndexWriter with "true" should remove all old files in the old index (actually with Lucene < 1.9 it removes '''all''' files in the index directory, no matter if they belong to Lucene).
  
- 
  ==== How can I index and search digits and other non-alphabetic characters? ====
- 
  The components responsible for this are various `Analyzers`. Make sure you use the appropriate analyzer. For examaple, !StandardAnaylzer does not remove numbers, but it removes most punctuation.
  
- 
  ==== Is the IndexWriter class, and especially the method addIndexes(Directory[]) thread safe? ====
- 
  Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe (it is a `synchronized` method). !IndexWriter in general is thread safe, i.e. you should use the same !IndexWriter object from all of your threads. Actually it's impossible to use more than one !IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.
  
- 
  ==== When is it possible for document IDs to change? ====
- 
  Documents are only re-numbered after there have been deletions.  Once there have been deletions, renumbering may be triggered by any document addition or index optimization.  Once an index is optimized, no renumbering will be performed until more deletions are made.
  
  If you require a persistent document id that survives deletions, then add it as a field to your documents.
  
- 
  ==== What is the purpose of write.lock file, when is it used, and by which classes? ====
- 
- The write.lock is used to keep processes from concurrently attempting
+ The write.lock is used to keep processes from concurrently attempting to modify an index.
- to modify an index.
  
  It is obtained by an !IndexWriter while it is open, and by an !IndexReader once documents have been deleted and until it is closed.
  
- 
  ==== What is the purpose of the commit.lock file, when is it used, and by which classes? ====
+ The commit.lock file is used to coordinate the contents of the 'segments' file with the files in the index.  It is obtained by an !IndexReader before it reads the 'segments' file, which names all of the other files in the index, and until the !IndexReader has opened all of these other files.
- 
- The commit.lock file is used to coordinate the contents of the 'segments'
- file with the files in the index.  It is obtained by an !IndexReader before it reads the 'segments' file, which names all of the other files in the
- index, and until the !IndexReader has opened all of these other files.
  
  The commit.lock is also obtained by the !IndexWriter when it is about to write the segments file and until it has finished trying to delete obsolete index files.
  
+ The commit.lock should thus never be held for long, since while it is obtained files are only opened or deleted, and one small file is read or written.
- The commit.lock should thus never be held for long, since while
- it is obtained files are only opened or deleted, and one small file is
- read or written.
  
+ Note that as of Lucene 2.1, the commit.lock is no longer used.  Instead, to prevent contention on the segments file, Lucene writes to segments-N files where each commit increments the N.  The write.lock is still used. See [[http://issues.apache.org/jira/browse/LUCENE-701|LUCENE-701]] for details.
- Note that as of Lucene 2.1, the commit.lock is no longer used.  Instead,
- to prevent contention on the segments file, Lucene writes to segments-N
- files where each commit increments the N.  The write.lock is still used.
- See [[http://issues.apache.org/jira/browse/LUCENE-701|LUCENE-701]] for
- details.
  
  ==== My program crashed and now I get a "Lock obtain timed out." error. Where is the lock and how can i delete it? ====
- 
  When using FSDirectory, Lock files are kept in the directory specified by the "org.apache.lucene.lockdir" system property if it is set, or by default in the directory specified by the "java.io.tmpdir" system property (on Unix boxes this is usually "/var/tmp" or "/tmp").
  
  If for some strange reason "java.io.tmpdir" is not set, then the directory path you specified to create your index is used.
@@ -576, +404 @@

  
  If you are certain that a lock file is not in use, you can delete it manually.  You should also look at the methods "[[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#isLocked(org.apache.lucene.store.Directory)|IndexReader.isLocked]]" and "[[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#unlock(org.apache.lucene.store.Directory)|IndexReader.unlock]]" if you are interested in writing recovery code that can remove locks automatically.
  
- 
  ==== Is there a maximum number of segment infos whose summary (name and document count) is stored in the segments file? ====
- 
  All segments in the index are listed in the segments file.  There is no hard limit. For an un-optimized index it is proportional to the log of the number of documents in the index. An optimized index contains a single segment.
  
- 
  ==== What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter?  Which files will be added or modified? ====
+ All of the segments are merged into a single new segment file. If the index was empty to begin with, no segments will be created, only the `segments` file.
- 
- All of the segments are merged into a single new segment file.
- If the index was empty to begin with, no segments will be created, only the `segments` file.
- 
  
  ==== If I decide not to optimize the index, when will the deleted documents actually get deleted? ====
- 
  Documents that are deleted are marked as deleted.  However, the space they consume in the index does not get reclaimed until the index is optimized.  That space will also eventually be reclaimed as more documents are added to the index, even if the index does not get optimized.
  
- 
  ==== How do I update a document or a set of documents that are already indexed? ====
- 
  There is no direct update procedure in Lucene. To update an index incrementally you must first '''delete''' the documents that were updated, and '''then re-add''' them to the index.
  
- 
  ==== How do I write my own Analyzer? ====
- 
  Here is an example:
  
  {{{
@@ -623, +440 @@

      }
  }
  }}}
- 
  All that being said, most of the heavy lifting in custom analyzers is done by calls to custom subclasses of TokenFilter.
  
  If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:
+ 
  {{{
  return new NameFilter(
- 	CaseNumberFilter(
+         CaseNumberFilter(
- 		new StopFilter(
- 			new LowerCaseFilter(
- 				new StandardFilter(
- 					new StandardTokenizer(reader)
- 			)
+                 new StopFilter(
+                         new LowerCaseFilter(
+                                 new StandardFilter(
+                                         new StandardTokenizer(reader)
+                         )
- 		), StopAnalyzer.ENGLISH_STOP_WORDS)
+                 ), StopAnalyzer.ENGLISH_STOP_WORDS)
- 	)
+         )
  );
  }}}
- 
  ==== How do I index non Latin characters? ====
- 
  Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:
  
  {{{
  String newStr = new String(someString.getBytes("UTF-8"));
  }}}
- 
- 
  ==== How can I index HTML documents? ====
- 
  In order to index HTML documents you need to first parse them to extract text that you want to index from them.  Have a look at [[http://lucene.apache.org/tika/|Tika, the content analysis toolkit]].
  
  Alternately...
@@ -665, +477 @@

  
  [[http://jerichohtml.sourceforge.net/|Jericho HTML Parser]] provides a simple [[http://jericho.htmlparser.net/docs/javadoc/index.html?net/htmlparser/jericho/TextExtractor.html|TextExtractor]] class that converts any segment of an HTML document into a string of space-separated words, optionally including the values from title, alt, label, and summary attributes.  The parser is also very tolerant of badly formatted HTML and can also handle server-based source tags such as JSP, ASP, PHP etc.
  
- 
  ==== How can I index XML documents? ====
- 
  In order to index XML documents you need to first parse them to extract text that you want to index from them.  Have a look at [[http://lucene.apache.org/tika/|Tika, the content analysis toolkit]].
  
  See also this article [[http://www-106.ibm.com/developerworks/library/j-lucene/|Parsing, indexing, and searching XML with Digester and Lucene]].
  
- 
  ==== How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc? ====
- 
  Have a look at [[http://lucene.apache.org/tika/|Tika, the content analysis toolkit]].
  
  Alternately: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that contain XML files. You can uncompress the file using Java's ZIP support, then parse e.g. meta.xml to get the title and e.g. content.xml to get the document's content. You can then add these to the Lucene index, typically using one Lucene field per property.
@@ -683, +491 @@

  
  For MS-Word, MS-Excel, MS-Visio, and MS-Powerpoint you might also want to take a look at [[http://poi.apache.org|Apache POI]].
  
- Lucene In Action contains an example of how to extract text from RTF files using the  [[http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c8c7324c56edcc88bf9c4e58495409b29@ehatchersolutions.com%3e|Swing RTFEditorKit class]].
+ Lucene In Action contains an example of how to extract text from RTF files using the  [[http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/<8c7324c56edcc88bf9c4e58495409b29@ehatchersolutions.com>|Swing RTFEditorKit class]].
  
  ==== How can I index Email (from MS-Exchange or another IMAP server) ? ====
- 
  Take a look at:
+ 
   * http://www.chencer.com/techno/java/lucene/imap.html
   * http://zoe.sourceforge.net/
  
  ==== How can I index PDF documents? ====
- 
  In order to index PDF documents you need to first parse them to extract text that you want to index from them.  Here are some PDF parsers that can help you with that:
  
  [[http://pdfbox.org/|PDFBox]] is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.
@@ -705, +512 @@

  
  Based on xpdf, there is a utility called [[http://pdftohtml.sourceforge.net/|pdftohtml]] that can translate PDF files into HTML files. This is also not a Java application.
  
- 
  ==== How can I index JSP files? ====
- 
  To index the content of JSPs that a user would see using a Web browser, you would need to write an application that acts as a Web client, in order to mimic the Web browser behaviour (i.e. a web crawler).  Once you have such an application, you should be able to point it to the desired JSP, retrieve the contents that the JSP generates, parse it, and feed it to Lucene. See [[http://java-source.net/open-source/crawlers|list of Open Source Crawlers in Java]].
  
  How to parse the output of the JSP depends on the type of content that the JSP generates.  In most cases the content is going to be in HTML format.
@@ -715, +520 @@

  Most importantly, do not try to index JSPs by treating them as normal files in your file system.  In order to index JSPs properly you need to access them via HTTP, acting like a Web client.
  
  ==== How can I index java source files? ====
- 
  There is an [[http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html|article at onjava.com]] that describes an example on how to index java sources not just as text files, but distinguishing between the different information like superclass, implented interfaces, methods, imported classes etc.
  
  Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the [[http://help.eclipse.org/help33/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/ASTParser.html|ASTParser]] of the [[http://www.eclipse.org/jdt/|eclipse java development tools]] is used.
  
- 
  ==== If I use a compound file-style index, can I still optimize my index? ====
- 
  Yes.  Each .cfs file created in the compound file-style index represents a single segment, which means you can still merge multiple segments into a single segment by optimizing the index.
  
- 
  ==== What is the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]), besides them taking different arguments? ====
- 
  When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the !IndexReader-based method requires that all indexes be open when passed.
  
  The primary advantage of the !IndexReader-based method is that one can pass it !IndexReaders that don't reside in a Directory.
  
- 
  ==== Can I use Lucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets? ====
- 
  Yes, you can.  Lucene is not limited to English, nor any other language.  To index text properly, you need to use an Analyzer appropriate for the language of the text you are indexing.  Lucene's default Analyzers work well for English.  There are a number of other Analyzers in [[http://lucene.apache.org/java/docs/lucene-sandbox/|Lucene Sandbox]], including those for Chinese, Japanese, and Korean.
  
  ==== Why do I have a deletable file (and old segment files remain) after running optimize? ====
+ This is normal behavior on Windows whenever you also have readers (IndexReaders or IndexSearchers) open against the index you are optimizing.  Lucene tries to remove old segments files once they have been merged (optimized).  However, because Windows does not allow removing files that are open for reading, Lucene catches an IOException deleting these files and and then records these pending deletable files into the "deletable" file.  On the next segments merge, which happens with explicit optimize() or close() calls and also whenever the IndexWriter flushes its internal RAMDirectory to disk (every IndexWriter.DEFAULT_MAX_BUFFERED_DOCS (default 10) addDocuments), Lucene will try again to delete these files (and additional ones) and any that still fail will be rewritten to the deletable file.
  
+ Note that as of 2.1 the deletable file is no longer used.  Instead, Lucene computes which files are no longer referenced by the index and removes them whenever a writer is created.
- This is normal behavior on Windows whenever you also have readers
- (IndexReaders or IndexSearchers) open against the index you are
- optimizing.  Lucene tries to remove old segments files once they have
- been merged (optimized).  However, because Windows does not allow
- removing files that are open for reading, Lucene catches an
- IOException deleting these files and and then records these pending
- deletable files into the "deletable" file.  On the next segments
- merge, which happens with explicit optimize() or close() calls and
- also whenever the IndexWriter flushes its internal RAMDirectory to
- disk (every IndexWriter.DEFAULT_MAX_BUFFERED_DOCS (default 10)
- addDocuments), Lucene will try again to delete these files (and
- additional ones) and any that still fail will be rewritten to the
- deletable file.
- 
- Note that as of 2.1 the deletable file is no longer used.  Instead,
- Lucene computes which files are no longer referenced by the index
- and removes them whenever a writer is created.
  
  ==== How do I speed up indexing? ====
- 
  See ImproveIndexingSpeed.
  

Mime
View raw message