Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 62570 invoked from network); 22 Dec 2004 23:19:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 22 Dec 2004 23:19:39 -0000 Received: (qmail 58583 invoked by uid 500); 22 Dec 2004 23:19:38 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 58153 invoked by uid 500); 22 Dec 2004 23:19:37 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 58140 invoked by uid 500); 22 Dec 2004 23:19:36 -0000 Received: (qmail 58137 invoked by uid 99); 22 Dec 2004 23:19:36 -0000 X-ASF-Spam-Status: No, hits=-9.8 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from minotaur.apache.org (HELO minotaur.apache.org) (209.237.227.194) by apache.org (qpsmtpd/0.28) with SMTP; Wed, 22 Dec 2004 15:19:32 -0800 Received: (qmail 62524 invoked from network); 22 Dec 2004 23:19:31 -0000 Received: from localhost.hyperreal.org (HELO minotaur.apache.org) (127.0.0.1) by localhost.hyperreal.org with SMTP; 22 Dec 2004 23:19:31 -0000 Content-Type: text/plain; charset="iso-8859-1" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable From: lucene-cvs@jakarta.apache.org To: lucene-cvs@jakarta.apache.org Subject: =?iso-8859-1?q?=5BJakarta_Lucene_Wiki=5D_Updated=3A__LuceneFAQ?= Date: Wed, 22 Dec 2004 23:19:31 -0000 Message-ID: <20041222231931.62512.72952@minotaur.apache.org> X-Spam-Rating: localhost.hyperreal.org 1.6.2 0/1000/N X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Date: 2004-12-22T15:19:31 Editor: DanielNaber Wiki: Jakarta Lucene Wiki Page: LuceneFAQ URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ new FAQ -- still a bit work in progress Change Log: ---------------------------------------------------------------------------= --- @@ -1,6 +1,495 @@ -=3D FAQs =3D +This FAQ is currently being worked on (2004-12-22), the update should be d= one in a few days. = - There are two official FAQs for Lucene: +Note that the [http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi ol= d FAQ] isn't maintained anymore. + +[[TableOfContents]] + +=3D=3D FAQ =3D=3D + +=3D=3D=3D General =3D=3D=3D + +=3D=3D=3D=3D What is the URL of Lucene's home page? =3D=3D=3D=3D + +Lucene's home is at The Jakarta Project: http://jakarta.apache.org/lucene/. + + +=3D=3D=3D=3D Are there any mailing lists available? =3D=3D=3D=3D + +There's a user list and a developer list, both available at http://jakarta= .apache.org/site/mail2.html#Lucene + + +=3D=3D=3D=3D What Java version is required to run Lucene? =3D=3D=3D=3D + +Lucene will run with JDK 1.1.8 and up. (???) + + +=3D=3D=3D=3D Will Lucene work with my Java application? =3D=3D=3D=3D + +Yes, Lucene has no external dependencies. + + +=3D=3D=3D=3D Where can I get the javadocs for the org.apache.lucene classe= s? =3D=3D=3D=3D + +The docs for all the classes are available online at http://jakarta.apache= .org/lucene/docs/api/. In addition, they are a part of the standard distrib= ution, and you can always recreate them by running `ant javadocs`. + + +=3D=3D=3D=3D Why can't I use Lucene with IBM JDK 1.3.1? =3D=3D=3D=3D + +Apparently there is a bug in IBM's JIT code in JDK 1.3.1. +To work around it, disable JIT for the `org.apache.lucene.store.OutputStre= am.writeInt` method by setting the following environment variable: + +`JITC_COMPILEOPT=3DSKIP{org/apache/lucene/store/OutputStream}{writeInt}` + + +=3D=3D=3D=3D Where does the name Lucene come from? =3D=3D=3D=3D + +Lucene is Doug Cutting's wife's middle name, and her maternal grandmother'= s first name. + + +=3D=3D=3D=3D Are there any alternatives to Lucene? =3D=3D=3D=3D + +Besides commercial products which we don't know much about there's also [h= ttp://www.egothor.org Egothor]. + + +=3D=3D=3D=3D Does Lucene have a web crawler? =3D=3D=3D=3D + +No, check out the [http://java-source.net/open-source/crawlers list of Ope= n Source Crawlers in Java]. + + = + +=3D=3D=3D Searching =3D=3D=3D + +=3D=3D=3D=3D What wildcard search support is available from Lucene? =3D=3D= =3D=3D + +Lucene supports wild card queries which allow you to perform searches such= as ''book*'', which will find documents containing terms such as ''book'',= ''bookstore'', ''booklet'', etc. Lucene refers to this type of a query as = a 'prefix query'. + +Lucene also supports wild card queries which allow you to place a wild car= d in the middle of the query term. For instance, you could make searches li= ke: ''mi*pelling''. That will match both ''misspelling'', which is the corr= ect way to spell this word, as well as ''mispelling'', which is a common sp= elling mistake. + +Another wild card character that you can use is '?', a question mark. The= ? will match a single character. This allows you to perform queries such = as ''Bra?il''. Such a query will match both ''Brasil'' and ''Brazil''. Luc= ene refers to this type of a query as a 'wildcard query'. + +'''Note''': Leading wildcards (e.g. ''*ook'') are '''not''' supported by t= he QueryParser. + + +=3D=3D=3D=3D Is the QueryParser thread-safe? =3D=3D=3D=3D + +Yes, `QueryParser` is thread-safe. Its static `parse` method creates a ne= w instance of `QueryParser` each time it is called. (??? so is it thread sa= fe only for the static method?) + + +=3D=3D=3D=3D How do I restrict searches to only return results from a limi= ted subset of documents in the index (e.g. for privacy reasons)? What is th= e best way to approach this? =3D=3D=3D=3D + +The QueryFilter http://jakarta.apache.org/lucene/docs/api/org/apache/lucen= e/search/QueryFilter.html class is designed precisely for such cases. + +Another way of doing it is the following: + +Just before calling `IndexSearcher.search()` add a clause to the query to = exclude documents in categories not permitted for this search. + +If you are restricting access with a prohibited term, and someone tries to= require that term, then the prohibited restriction wins. If you are restri= cting access with a required term, and they try prohibiting that term, then= they will get no documents in their search result. + +As for deciding whether to use required or prohibited terms, if possible, +you should choose the method that names the less frequent term. That will +make queries faster. + + +=3D=3D=3D=3D What is the order of fields returned by Document.fields()? = =3D=3D=3D=3D + +Fields are returned in the same order they were added to the document. + + +=3D=3D=3D=3D How does one determine which documents do not have a certain = term? =3D=3D=3D=3D + +There is no direct way of doing that. You could add a term "x" to every d= ocument, and then search for "+x -y" to find all of the documents that don'= t have "y". Note that for large collections this would be slow because of t= he high term frequency for term "x". + + +=3D=3D=3D=3D How do I get the last document added that has a particular te= rm? =3D=3D=3D=3D + +Call: + +`TermDocs td =3D IndexReader.termDocs(Term);` + +Then grab the last `Term` in `TermDocs` that this method returns. + + +=3D=3D=3D=3D Does MultiSearcher do anything particularly efficient to sear= ch multiple indices or does it simply search one after the other? =3D=3D=3D= =3D + +`MultiSearcher` searches indices sequentially. Use ParallelMultiSearcher a= s a searcher that performs multiple searches in parallel. + + +=3D=3D=3D=3D Is there a way to use a proximity operator (like near or with= in) with Lucene? =3D=3D=3D=3D + +There is a variable called `slop` in `PhraseQuery` that allows you to perf= orm NEAR/WITHIN-like queries. + +By default, `slop` is set to 0 so that only exact phrases will match. +However, you can alter the value using the `setSlop(int)` method. + +When using QueryParser you can use this syntax to specify the slop: "doug = cutting"~2 will find documents that contain "doug cutting" as well as ones = that contain "cutting doug". + + +=3D=3D=3D=3D Are Wildcard, Prefix, and Fuzzy queries case sensitive? =3D= =3D=3D=3D + +Not, but unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy= queries are not passed through the `Analyzer`, which is the component that= performs operations such as stemming. + +The reason for skipping the `Analyzer` is that if you were searching for '= '"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since = that would then match ''"dog*"'', which is not the +intended query. + + +=3D=3D=3D=3D Why does IndexReader's maxDoc() return an 'incorrect' number = of documents sometimes? =3D=3D=3D=3D + +According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns on= e greater than the largest possible document number".'' + +In other words, the number returned by `maxDoc()` does not necessarily mat= ch the actual number of undeleted documents in the index. + +Deleted documents do not get removed from the index immediately, unless yo= u call `optimize()`. + + +=3D=3D=3D=3D Is there a way to get a text summary of an indexed document w= ith Lucene? =3D=3D=3D=3D + +You could store the documents summary in the index and then use the Highli= ghter from the sandbox. + + +=3D=3D=3D=3D Can I search an index while it is being optimized? =3D=3D=3D= =3D + +Yes, an index can be searched and optimized simultaneously. + + +=3D=3D=3D=3D Can I cache search results with Lucene? =3D=3D=3D=3D + +Lucene does come with a simple cache mechanism, if you use [http://jakarta= .apache.org/lucene/docs/api/org/apache/lucene/search/Filter.html Lucene Fil= ters] . +The classes to look at are [http://jakarta.apache.org/lucene/docs/api/org/= apache/lucene/search/CachingWrapperFilter.html CachingWrapperFilter] and [h= ttp://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilt= er.html QueryFilter]. + + +=3D=3D=3D=3D Is the IndexSearcher thread-safe? =3D=3D=3D=3D + +'''Yes''', IndexSearcher is thread-safe. Multiple search threads may acce= ss the index concurrently without any problems. + + +=3D=3D=3D=3D Is there a way to retrieve the original term positions during= the search? =3D=3D=3D=3D + +Yes, see the Javadoc for `IndexReader.termPositions()`. + + +=3D=3D=3D=3D How do I retrieve all the values of a particular field that e= xists within an index, across all documents? =3D=3D=3D=3D + +The trick is to enumerate terms with that field. Terms are sorted first = +by field, then by text, so all terms with a given field are adjacent in = +enumerations. Term enumeration is also efficient. + +{{{ +try +{ + TermEnum terms =3D indexReader.terms(new Term("FIELD-NAME-HERE", "")); + while ("FIELD-NAME-HERE".equals(enum.term().field())) + { + // ... collect enum.term().text() ... + + if (!terms.next()) + break; + } +} +finally +{ + terms.close(); +} +}}} + + +=3D=3D=3D=3D Can Lucene do a "search within search", so that the second se= arch is constrained by the results of the first query? =3D=3D=3D=3D + +Yes. There are two primary options: + + * Use `QueryFilter` with the previous query as the filter. (you can searc= h the mailing list archives for `QueryFilter` and Doug Cutting's recommenda= tions against using it for this purpose) + * Combine the previous query with the current query using `BooleanQuery`,= using the previous query as required. + +The `BooleanQuery` approach is the recommended one. + + +=3D=3D=3D=3D Does the position of the matches in the text affects the scor= ing? =3D=3D=3D=3D + +No, the position of matches within a field does not affect ranking. + + +=3D=3D=3D=3D How do I make sure that a match in a document title has great= er weight than than a match in a document body? =3D=3D=3D=3D + +If you put the title in a separate field from the body, and search both fi= elds, matches in the title will usually be stronger without explicit boosti= ng. This is because the scores are normalized by the length of the field, a= nd the title tends to be much shorter than the body. Therefore, even witho= ut boosting, title matches usually come before body matches. + + + +=3D=3D=3D Indexing =3D=3D=3D + +=3D=3D=3D=3D Can I use Lucene to crawl my site or other sites on the Inter= net? =3D=3D=3D=3D + +No. Lucene does not know how to access external document, nor does it know= how to extract the content and links of HTML and other document format. Lu= cene focus on the indexing and searching and does it great. However, severa= l crawlers are available which you could use: [http://java-source.net/open-= source/crawlers list of Open Source Crawlers in Java] + + +=3D=3D=3D=3D How do I perform a simple indexing of a set of documents? =3D= =3D=3D=3D + +The easiest way is to re-index the entire document set periodically or whe= never it changes. All you need to do is to create an instance of IndexWrite= r(), iterate over your document set, create for each document a Lucene Docu= ment object and add it to the IndexWriter. When you are done make sure to c= lose the IndexWriter. This will release all of its resources and will close= the files it created. = + + +=3D=3D=3D=3D How can I add document(s) to the index? =3D=3D=3D=3D + +Simply create an IndexWriter and use its addDocument() method. Make sure t= o create the IndexWriter with the 'create' flag set to false and make sure = to close the IndexWriter when you are done adding the documents. + + +=3D=3D=3D=3D Where does Lucene store the index it builds? =3D=3D=3D=3D + +Typically, the index is stored in a set of files that Lucene creates in a = directory of your choice. If your system uses multiple independent indices,= simply create an separate directory for each index. = + +Lucene's API also provide a way to use or implement other storage methods = such as a nonresistance in-memory storage, or a mapping of Lucene data to a= ny third party database. + + +=3D=3D=3D=3D Does Lucene store a full copy of the indexed documents? =3D= =3D=3D=3D + +It is up to you. You can tell Lucene what document information to use just= for indexing and what document information to also store in the index (wit= h or without indexing). + + +=3D=3D=3D=3D What happens when you IndexWriter.add() a document that is al= ready in the index? Does it overwrite the previous document? =3D=3D=3D=3D + +No, there will be multiple copies of the same document in the index. + + +=3D=3D=3D=3D How do I delete documents from the index? =3D=3D=3D=3D + +If you know the document number of a document that you want to delete you = may use: + +`IndexReader.delete(docNum)` + +That will delete the document numbered `docNum` from the index. Once a do= cument is deleted it will not appear in `TermDocs` nor `TermPositions` enum= erations. + +Attempts to read its field with the `document` method will result in an er= ror. The presence of this document may still be reflected in the `docFreq`= statistic, though this will be corrected eventually as the index is furthe= r modified. + +If you want to delete all (1 or more) documents that contain a specific te= rm you may use: + +`IndexReader.delete(Term)` + +This is useful if one uses a document field to hold a unique ID string for +the document. Then to delete such a document, one merely constructs a +term with the appropriate field and the unique ID string as its text and +passes it to this method. Because a variable number of document can be aff= ected by this method call this method returns the number of documents delet= ed. + + +=3D=3D=3D=3D Is there a way to limit the size of an index? =3D=3D=3D=3D + +This question is sometimes brought up because of the 2GB file size limit o= f some 32-bit operating systems. + +This is a slightly modified answer from Doug Cutting: + +The easiest thing is to set `IndexWriter.maxMergeDocs`. + +If, for instance, you hit the 2GB limit at 8M documents set `maxMergeDocs`= to 7M. That will keep Lucene from trying to merge an index that won't fit= in your filesystem. It will actually effectively round this down to the n= ext lower power of `Index.mergeFactor`. + +So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M L= ucene will generate a series of 1M document indexes, since merging 10 of th= ese would exceed the maximum. + +A slightly more complex solution: + +You could further minimize the number of segments if, when you've added 7M= documents, optimize the index and start a new index. Then use `MultiSearc= her` to search the indexes. + +An even more complex and optimal solution: + +Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a = subdirectory and represents the file as a series of files. + + +=3D=3D=3D=3D Why is it important to use the same analyzer type during inde= xing and search? =3D=3D=3D=3D + +The analyzer controls how the text is broken into terms which are then use= d to index the document. If you are using analyzer of one type to index and= an analyzer of a different type to parse the search query, it is possible = that the same word will be mapped to two different terms and this will resu= lt in missing or false hits. = + + +=3D=3D=3D=3D What is index optimization and when should I use it? =3D=3D= =3D=3D + +The IndexWriter class supports an optimize() method that compacts the inde= x database and speedup queries. You may want to use this method after perfo= rming a complete indexing of your document set or after incremental updates= of the index. If your incremental update adds documents frequently, you wa= nt to perform the optimization only once in a while to avoid the extra over= head of the optimization. + +=3D=3D=3D=3D What are Segments? =3D=3D=3D=3D + +The index database is composed of 'segments' each stored in a separate fil= e. When you add documents to the index, new segments may be created. You ca= n compact the database and reduce the number of segments by optimizing it (= see a separate question regarding index optimization). = + = + +=3D=3D=3D=3D Is Lucene index database platform independent? =3D=3D=3D=3D + +Yes, you can copy a Lucene index directory from one platform to another an= d it will work just as well. + = + +=3D=3D=3D=3D When I recreate an index from scratch, do I have to delete th= e old index files? =3D=3D=3D=3D + +No, creating the index writer with "true" should remove all old files in t= he old index. + + = +=3D=3D=3D=3D How can I index and search digits and other non-alphabetic ch= aracters? =3D=3D=3D=3D + +The components responsible for this are various `Analyzers.` + +The demos included in Lucene distribution use `StopAnalyzer`, which filter= s out non-alphabetic characters. To include non-alphabetic characters, such= as digits and various punctuation characters in your index use `org.apache= .lucene.analysis.standard.StandardAnalyzer` instead of `StopAnalyzer`. + + +=3D=3D=3D=3D Is the IndexWriter class, and especially the method addIndexe= s(Directory[]) thread safe? =3D=3D=3D=3D + +Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe. It is a= `final synchronized` method. + + +=3D=3D=3D=3D Do document IDs change after merging indices or after documen= t deletion? =3D=3D=3D=3D + +Yes, document IDs do change. + + +=3D=3D=3D=3D What is the purpose of write.lock file, when is it used, and = by which classes? =3D=3D=3D=3D + +The write.lock is used to keep processes from concurrently attempting +to modify an index. = + +It is obtained by an `IndexWriter` while it is open, and by an `IndexReade= r` once documents have been deleted and until it is closed. + + +=3D=3D=3D=3D What is the purpose of the commit.lock file, when is it used,= and by which classes? =3D=3D=3D=3D + +The commit.lock file is used to coordinate the contents of the 'segments' +file with the files in the index. It is obtained by an `IndexReader` befo= re it reads the 'segments' file, which names all of the other files in the +index, and until the `IndexReader` has opened all of these other files. + +The commit.lock is also obtained by the `IndexWriter` when it is about to = write the segments file and until it has finished trying to delete obsolete= index files. + +The commit.lock should thus never be held for long, since while +it is obtained files are only opened or deleted, and one small file is +read or written. + + +=3D=3D=3D=3D Is there a maximum number of segment infos whose summary (nam= e and document count) is stored in the segments file? =3D=3D=3D=3D + +All segments in the index are listed in the segments file. There is no ha= rd limit. For an un-optimized index it is proportional to the log of the nu= mber of documents in the index. An optimized index contains a single segmen= t. + + +=3D=3D=3D=3D What happens when I open an IndexWriter, optimize the index, = and then close the IndexWriter? Which files will be added or modified? =3D= =3D=3D=3D + +All of the segments are merged into a single new segment file. +If the index was empty to begin with, no segments will be created, only th= e `segments` file. + + +=3D=3D=3D=3D If I decide not to optimize the index, when will the deleted = documents actually get deleted? =3D=3D=3D=3D + +Document that are deleted really are in deleted (???). However, the space= they consume in the index does not get reclaimed until the index is optimi= zed. That space will also eventually be reclaimed as more documents are ad= ded to the index, even if the index does not get optimized. + + +=3D=3D=3D=3D How do I update a document or a set of documents that are alr= eady indexed? =3D=3D=3D=3D + +There is no direct update procedure in Lucene. To update an index incremen= tally you must first '''delete''' the documents that were updated, and '''t= hen re-add''' them to the index. + + +=3D=3D=3D=3D How do I write my own Analyzer? =3D=3D=3D=3D + +Here is an example: + +{{{ +public class MyAnalyzer extends Analyzer +{ + private static final Analyzer STANDARD =3D new StandardAnalyzer(); + + public TokenStream tokenStream(String field, final Reader reader) = + { + // do not tokenize field called 'element' + if ("element".equals(field)) { + return new CharTokenizer(reader) { + protected boolean isTokenChar(char c) { + return true; + } + }; + } else { + // use standard analyzer + return STANDARD.tokenStream(field, reader); + } + } +} +}}} + + +=3D=3D=3D=3D How do I index non Latin characters? =3D=3D=3D=3D + +The solution is to ensure that the query string is encoded the same way th= at strings in the index are. For instance, something along the lines of thi= s will work if your index is also using UTF-8 encoding. + +{{{ +String queryStr =3D new String("query string here".getBytes("UTF-8")); +}}} + + +=3D=3D=3D=3D How can I index HTML documents? =3D=3D=3D=3D + +In order to index HTML documents you need to first parse them to extract t= ext that you want to index from them. Here are some HTML parsers that can = help you with that: + +An example that uses JavaCC to parse HTML into Lucene Document objects is= provided in the [http://jakarta.apache.org/lucene/docs/demo3.html Lucene w= eb application demo] that comes with the Lucene distribution. + +The [http://www.apache.org/~andyc/neko/doc/html/ CyberNeko HTML Parser] le= ts you parse HTML documents. It's relatively easy to remove most of the tag= s from an HTML document (or all if you want), and then use the ones you lef= t in to help create metadata for your Lucene document. NekoHTML also provid= es a DOM model for navigating through the HTML. + +[http://jtidy.sourceforge.net/ JTidy] cleans up HTML, and can provide a DO= M interface to the HTML files through a Java API. + + +=3D=3D=3D=3D How can I index XML documents? =3D=3D=3D=3D + +In order to index XML documents you need to first parse them to extract te= xt that you want to index from them. Here are some XML parsers that can he= lp you with that: + +See the [http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contribu= tions/XML-Indexing-Demo/ XML Demo]. This contribution is some sample code = that demonstrates adding simple XML documents into the index. It creates a= new Document object for each file, and then populates the Document with a = Field for each XML element, recursively. There are examples included for bo= th SAX and DOM. = + +See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsi= ng, indexing, and searching XML with Digester and Lucene]. + + +=3D=3D=3D=3D How can I index MS-Word documents? =3D=3D=3D=3D + +In order to index Word documents you need to first parse them to extract t= ext that you want to index from them. Here are some Word parsers that can = help you with that: + +[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an early developme= nt level Microsoft Word parser for versions of Word from Office 97, 2000, a= nd XP. + +[http://www.textmining.org/ Simple Text Extractor Library], relies on POI. + + +=3D=3D=3D=3D How can I index MS-Excel documents? =3D=3D=3D=3D + +In order to index Excel documents you need to first parse them to extract = text that you want to index from them. Here are some Excel parsers that ca= n help you with that: + +[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an excellent Micro= soft Excel parser for versions of Excel from Office 97, 2000, and XP. You = can also modify Excel files with this tool. + + +=3D=3D=3D=3D How can I index MS-Powerpoint documents? =3D=3D=3D=3D + +In order to index Powerpoint documents you need to first parse them to ext= ract text that you want to index from them. You can use the [http://jakart= a.apache.org/poi/ Jakarta Apache POI], as it contains a parser for Powerpoi= nt documents. + + +=3D=3D=3D=3D How can I index RTF documents? =3D=3D=3D=3D + +In order to index RTF documents you need to first parse them to extract te= xt that you want to index from them. Here are some RTF parsers that can he= lp you with that: + +[http://www.tetrasix.com/ MajiX] is a translation utility that will turn R= TF (Rich Text Format) files into XML files. These XML files could be indexe= d like any other XML file, or you could write some custom code. (??? doesn'= t seem to exist anymore -- mention Java's Swing widget instead that can be = used to access RTF) + + +=3D=3D=3D=3D How can I index PDF documents? =3D=3D=3D=3D + +In order to index PDF documents you need to first parse them to extract te= xt that you want to index from them. Here are some PDF parsers that can he= lp you with that: + +[http://pdfbox.org/ PDFBox] is a Java API from Ben Litchfield that will le= t you access the contents of a PDF document. It comes with integration clas= ses for Lucene to translate a PDF into a Lucene document. + +[http://www.foolabs.com/xpdf/ XPDF] is an open source tool that is licens= ed under the GPL. It's not a Java tool, but there is a utility called pdfto= text that can translate PDF files into text files on most platforms from th= e command line. + +Based on xpdf, there is a utility called [http://pdftohtml.sourceforge.net= / pdftohtml] that can translate PDF files into HTML files. This is also not= a Java application. + +[http://www.jpedal.org/ JPedal] is a Java API for extracting text and imag= es from PDF documents. + + +=3D=3D=3D=3D How can I index JSP files? =3D=3D=3D=3D + +To index the content of JSPs that a user would see using a Web browser, yo= u would need to write an application that acts as a Web client, in order to= mimic the Web browser behaviour (i.e. a web crawler). Once you have such = an application, you should be able to point it to the desired JSP, retrieve= the contents that the JSP generates, parse it, and feed it to Lucene. See = [http://java-source.net/open-source/crawlers list of Open Source Crawlers i= n Java]. + +How to parse the output of the JSP depends on the type of content that the= JSP generates. In most cases the content is going to be in HTML format. + +Most importantly, do not try to index JSPs by treating them as normal file= s in your file system. In order to index JSPs properly you need to access = them via HTTP, acting like a Web client. + + +=3D=3D=3D=3D If I use a compound file-style index, do I still need to opti= mize my index? =3D=3D=3D=3D + +Yes. Each .cfs file created in the compound file-style index represents a= single segment, which means you can still merge multiple segments into a s= ingle segment by optimizing the index. + + +=3D=3D=3D=3D What is the difference between IndexWriter.addIndexes(IndexRe= ader[]) and IndexWriter.addIndexes(Directory[]), besides them taking differ= ent arguments? =3D=3D=3D=3D + +When merging lots of indexes (more than the mergeFactor), the Directory-ba= sed method will use fewer file handles and less memory, as it will only eve= r open mergeFactor indexes at once, while the IndexReader-based method requ= ires that all indexes be open when passed. + +The primary advantage of the IndexReader-based method is that one can pass= it IndexReaders that don't reside in a Directory. + + +=3D=3D=3D=3D Can I use Lucene to index text in Chinese, Japanese, Korean, = and other multi-byte character sets? =3D=3D=3D=3D + +Yes, you can. Lucene is not limited to English, nor any other language. = To index text properly, you need to use an Analyzer appropriate for the lan= guage of the text you are indexing. Lucene's default Analyzers work well f= or English. There are a number of other Analyzers in [http://jakarta.apach= e.org/lucene/docs/lucene-sandbox/ Lucene Sandbox], including those for Chin= ese, Japanese, and Korean. = - ||Lucene FAQ at JGuru||[http://www.jguru.com/faq/Lucene]|| - ||Original Lucene FAQs, no longer maintained||[http://lucene.sourceforge.= net/cgi-bin/faq/faqmanager.cgi]|| --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org