Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
From: lucene-cvs@jakarta.apache.org
To: lucene-cvs@jakarta.apache.org
Subject: =?iso-8859-1?q?=5BJakarta_Lucene_Wiki=5D_Updated=3A__LuceneFAQ?=
Date: Wed, 22 Dec 2004 23:19:31 -0000
Message-ID: <20041222231931.62512.72952@minotaur.apache.org>

   Date: 2004-12-22T15:19:31
   Editor: DanielNaber
   Wiki: Jakarta Lucene Wiki
   Page: LuceneFAQ
   URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ

   new FAQ -- still a bit work in progress

Change Log:

---------------------------------------------------------------------------=
---
@@ -1,6 +1,495 @@
-=3D FAQs =3D
+This FAQ is currently being worked on (2004-12-22), the update should be d=
one in a few days.
 =

- There are two official FAQs for Lucene:
+Note that the [http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi ol=
d FAQ] isn't maintained anymore.
+
+[[TableOfContents]]
+
+=3D=3D FAQ =3D=3D
+
+=3D=3D=3D General =3D=3D=3D
+
+=3D=3D=3D=3D What is the URL of Lucene's home page? =3D=3D=3D=3D
+
+Lucene's home is at The Jakarta Project: http://jakarta.apache.org/lucene/.
+
+
+=3D=3D=3D=3D Are there any mailing lists available? =3D=3D=3D=3D
+
+There's a user list and a developer list, both available at http://jakarta=
.apache.org/site/mail2.html#Lucene
+
+
+=3D=3D=3D=3D What Java version is required to run Lucene? =3D=3D=3D=3D
+
+Lucene will run with JDK 1.1.8 and up. (???)
+
+
+=3D=3D=3D=3D Will Lucene work with my Java application? =3D=3D=3D=3D
+
+Yes, Lucene has no external dependencies.
+
+
+=3D=3D=3D=3D Where can I get the javadocs for the org.apache.lucene classe=
s? =3D=3D=3D=3D
+
+The docs for all the classes are available online at http://jakarta.apache=
.org/lucene/docs/api/. In addition, they are a part of the standard distrib=
ution, and you can always recreate them by running `ant javadocs`.
+
+
+=3D=3D=3D=3D Why can't I use Lucene with IBM JDK 1.3.1? =3D=3D=3D=3D
+
+Apparently there is a bug in IBM's JIT code in JDK 1.3.1.
+To work around it, disable JIT for the `org.apache.lucene.store.OutputStre=
am.writeInt` method by setting the following environment variable:
+
+`JITC_COMPILEOPT=3DSKIP{org/apache/lucene/store/OutputStream}{writeInt}`
+
+
+=3D=3D=3D=3D Where does the name Lucene come from? =3D=3D=3D=3D
+
+Lucene is Doug Cutting's wife's middle name, and her maternal grandmother'=
s first name.
+
+
+=3D=3D=3D=3D Are there any alternatives to Lucene? =3D=3D=3D=3D
+
+Besides commercial products which we don't know much about there's also [h=
ttp://www.egothor.org Egothor].
+
+
+=3D=3D=3D=3D Does Lucene have a web crawler? =3D=3D=3D=3D
+
+No, check out the [http://java-source.net/open-source/crawlers list of Ope=
n Source Crawlers in Java].
+
+ =

+
+=3D=3D=3D Searching =3D=3D=3D
+
+=3D=3D=3D=3D What wildcard search support is available from Lucene? =3D=3D=
=3D=3D
+
+Lucene supports wild card queries which allow you to perform searches such=
 as ''book*'', which will find documents containing terms such as ''book'',=
 ''bookstore'', ''booklet'', etc. Lucene refers to this type of a query as =
a 'prefix query'.
+
+Lucene also supports wild card queries which allow you to place a wild car=
d in the middle of the query term. For instance, you could make searches li=
ke: ''mi*pelling''. That will match both ''misspelling'', which is the corr=
ect way to spell this word, as well as ''mispelling'', which is a common sp=
elling mistake.
+
+Another wild card character that you can use is '?', a question mark.  The=
 ? will match a single character.  This allows you to perform queries such =
as ''Bra?il''. Such a query will match both ''Brasil'' and ''Brazil''.  Luc=
ene refers to this type of a query as a 'wildcard query'.
+
+'''Note''': Leading wildcards (e.g. ''*ook'') are '''not''' supported by t=
he QueryParser.
+
+
+=3D=3D=3D=3D Is the QueryParser thread-safe? =3D=3D=3D=3D
+
+Yes, `QueryParser` is thread-safe.  Its static `parse` method creates a ne=
w instance of `QueryParser` each time it is called. (??? so is it thread sa=
fe only for the static method?)
+
+
+=3D=3D=3D=3D How do I restrict searches to only return results from a limi=
ted subset of documents in the index (e.g. for privacy reasons)? What is th=
e best way to approach this? =3D=3D=3D=3D
+
+The QueryFilter http://jakarta.apache.org/lucene/docs/api/org/apache/lucen=
e/search/QueryFilter.html class is designed precisely for such cases.
+
+Another way of doing it is the following:
+
+Just before calling `IndexSearcher.search()` add a clause to the query to =
exclude documents in categories not permitted for this search.
+
+If you are restricting access with a prohibited term, and someone tries to=
 require that term, then the prohibited restriction wins. If you are restri=
cting access with a required term, and they try prohibiting that term, then=
 they will get no documents in their search result.
+
+As for deciding whether to use required or prohibited terms, if possible,
+you should choose the method that names the less frequent term.  That will
+make queries faster.
+
+
+=3D=3D=3D=3D What is the order of fields returned by Document.fields()? =
=3D=3D=3D=3D
+
+Fields are returned in the same order they were added to the document.
+
+
+=3D=3D=3D=3D How does one determine which documents do not have a certain =
term? =3D=3D=3D=3D
+
+There is no direct way of doing that.  You could add a term "x" to every d=
ocument, and then search for "+x -y" to find all of the documents that don'=
t have "y". Note that for large collections this would be slow because of t=
he high term frequency for term "x".
+
+
+=3D=3D=3D=3D How do I get the last document added that has a particular te=
rm? =3D=3D=3D=3D
+
+Call:
+
+`TermDocs td =3D IndexReader.termDocs(Term);`
+
+Then grab the last `Term` in `TermDocs` that this method returns.
+
+
+=3D=3D=3D=3D Does MultiSearcher do anything particularly efficient to sear=
ch multiple indices or does it simply search one after the other? =3D=3D=3D=
=3D
+
+`MultiSearcher` searches indices sequentially. Use ParallelMultiSearcher a=
s a searcher that performs multiple searches in parallel.
+
+
+=3D=3D=3D=3D Is there a way to use a proximity operator (like near or with=
in) with Lucene? =3D=3D=3D=3D
+
+There is a variable called `slop` in `PhraseQuery` that allows you to perf=
orm NEAR/WITHIN-like queries.
+
+By default, `slop` is set to 0 so that only exact phrases will match.
+However, you can alter the value using the `setSlop(int)` method.
+
+When using QueryParser you can use this syntax to specify the slop: "doug =
cutting"~2 will find documents that contain "doug cutting" as well as ones =
that contain "cutting doug".
+
+
+=3D=3D=3D=3D Are Wildcard, Prefix, and Fuzzy queries case sensitive? =3D=
=3D=3D=3D
+
+Not, but unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy=
 queries are not passed through the `Analyzer`, which is the component that=
 performs operations such as stemming.
+
+The reason for skipping the `Analyzer` is that if you were searching for '=
'"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since =
that would then match ''"dog*"'', which is not the
+intended query.
+
+
+=3D=3D=3D=3D Why does IndexReader's maxDoc() return an 'incorrect' number =
of documents sometimes? =3D=3D=3D=3D
+
+According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns on=
e greater than the largest possible document number".''
+
+In other words, the number returned by `maxDoc()` does not necessarily mat=
ch the actual number of undeleted documents in the index.
+
+Deleted documents do not get removed from the index immediately, unless yo=
u call `optimize()`.
+
+
+=3D=3D=3D=3D Is there a way to get a text summary of an indexed document w=
ith Lucene? =3D=3D=3D=3D
+
+You could store the documents summary in the index and then use the Highli=
ghter from the sandbox.
+
+
+=3D=3D=3D=3D Can I search an index while it is being optimized? =3D=3D=3D=
=3D
+
+Yes, an index can be searched and optimized simultaneously.
+
+
+=3D=3D=3D=3D Can I cache search results with Lucene? =3D=3D=3D=3D
+
+Lucene does come with a simple cache mechanism, if you use [http://jakarta=
.apache.org/lucene/docs/api/org/apache/lucene/search/Filter.html Lucene Fil=
ters] .
+The classes to look at are [http://jakarta.apache.org/lucene/docs/api/org/=
apache/lucene/search/CachingWrapperFilter.html CachingWrapperFilter] and [h=
ttp://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilt=
er.html QueryFilter].
+
+
+=3D=3D=3D=3D Is the IndexSearcher thread-safe? =3D=3D=3D=3D
+
+'''Yes''', IndexSearcher is thread-safe.  Multiple search threads may acce=
ss the index concurrently without any problems.
+
+
+=3D=3D=3D=3D Is there a way to retrieve the original term positions during=
 the search? =3D=3D=3D=3D
+
+Yes, see the Javadoc for `IndexReader.termPositions()`.
+
+
+=3D=3D=3D=3D How do I retrieve all the values of a particular field that e=
xists within an index, across all documents? =3D=3D=3D=3D
+
+The trick is to enumerate terms with that field.  Terms are sorted first =

+by field, then by text, so all terms with a given field are adjacent in =

+enumerations.  Term enumeration is also efficient.
+
+{{{
+try
+{
+    TermEnum terms =3D indexReader.terms(new Term("FIELD-NAME-HERE", ""));
+    while ("FIELD-NAME-HERE".equals(enum.term().field()))
+    {
+        // ... collect enum.term().text() ...
+
+        if (!terms.next())
+            break;
+    }
+}
+finally
+{
+    terms.close();
+}
+}}}
+
+
+=3D=3D=3D=3D Can Lucene do a "search within search", so that the second se=
arch is constrained by the results of the first query? =3D=3D=3D=3D
+
+Yes.  There are two primary options:
+
+ * Use `QueryFilter` with the previous query as the filter. (you can searc=
h the mailing list archives for `QueryFilter` and Doug Cutting's recommenda=
tions against using it for this purpose)
+ * Combine the previous query with the current query using `BooleanQuery`,=
 using the previous query as required.
+
+The `BooleanQuery` approach is the recommended one.
+
+
+=3D=3D=3D=3D Does the position of the matches in the text affects the scor=
ing? =3D=3D=3D=3D
+
+No, the position of matches within a field does not affect ranking.
+
+
+=3D=3D=3D=3D How do I make sure that a match in a document title has great=
er weight than than a match in a document body? =3D=3D=3D=3D
+
+If you put the title in a separate field from the body, and search both fi=
elds, matches in the title will usually be stronger without explicit boosti=
ng. This is because the scores are normalized by the length of the field, a=
nd the title tends to be much shorter than the body.  Therefore, even witho=
ut boosting, title matches usually come before body matches.
+
+
+
+=3D=3D=3D Indexing =3D=3D=3D
+
+=3D=3D=3D=3D Can I use Lucene to crawl my site or other sites on the Inter=
net? =3D=3D=3D=3D
+
+No. Lucene does not know how to access external document, nor does it know=
 how to extract the content and links of HTML and other document format. Lu=
cene focus on the indexing and searching and does it great. However, severa=
l crawlers are available which you could use: [http://java-source.net/open-=
source/crawlers list of Open Source Crawlers in Java]
+
+
+=3D=3D=3D=3D How do I perform a simple indexing of a set of documents? =3D=
=3D=3D=3D
+
+The easiest way is to re-index the entire document set periodically or whe=
never it changes. All you need to do is to create an instance of IndexWrite=
r(), iterate over your document set, create for each document a Lucene Docu=
ment object and add it to the IndexWriter. When you are done make sure to c=
lose the IndexWriter. This will release all of its resources and will close=
 the files it created. =

+
+
+=3D=3D=3D=3D How can I add document(s) to the index? =3D=3D=3D=3D
+
+Simply create an IndexWriter and use its addDocument() method. Make sure t=
o create the IndexWriter with the 'create' flag set to false and make sure =
to close the IndexWriter when you are done adding the documents.
+
+
+=3D=3D=3D=3D Where does Lucene store the index it builds? =3D=3D=3D=3D
+
+Typically, the index is stored in a set of files that Lucene creates in a =
directory of your choice. If your system uses multiple independent indices,=
 simply create an separate directory for each index. =

+
+Lucene's API also provide a way to use or implement other storage methods =
such as a nonresistance in-memory storage, or a mapping of Lucene data to a=
ny third party database.
+
+
+=3D=3D=3D=3D Does Lucene store a full copy of the indexed documents? =3D=
=3D=3D=3D
+
+It is up to you. You can tell Lucene what document information to use just=
 for indexing and what document information to also store in the index (wit=
h or without indexing).
+
+
+=3D=3D=3D=3D What happens when you IndexWriter.add() a document that is al=
ready in the index?  Does it overwrite the previous document? =3D=3D=3D=3D
+
+No, there will be multiple copies of the same document in the index.
+
+
+=3D=3D=3D=3D How do I delete documents from the index? =3D=3D=3D=3D
+
+If you know the document number of a document that you want to delete you =
may use:
+
+`IndexReader.delete(docNum)`
+
+That will delete the document numbered `docNum` from the index.  Once a do=
cument is deleted it will not appear in `TermDocs` nor `TermPositions` enum=
erations.
+
+Attempts to read its field with the `document` method will result in an er=
ror.  The presence of this document may still be reflected in the `docFreq`=
 statistic, though this will be corrected eventually as the index is furthe=
r modified.
+
+If you want to delete all (1 or more) documents that contain a specific te=
rm you may use:
+
+`IndexReader.delete(Term)`
+
+This is useful if one uses a document field to hold a unique ID string for
+the document.  Then to delete such a document, one merely constructs a
+term with the appropriate field and the unique ID string as its text and
+passes it to this method. Because a variable number of document can be aff=
ected by this method call this method returns the number of documents delet=
ed.
+
+
+=3D=3D=3D=3D Is there a way to limit the size of an index? =3D=3D=3D=3D
+
+This question is sometimes brought up because of the 2GB file size limit o=
f some 32-bit operating systems.
+
+This is a slightly modified answer from Doug Cutting:
+
+The easiest thing is to set `IndexWriter.maxMergeDocs`.
+
+If, for instance, you hit the 2GB limit at 8M documents set `maxMergeDocs`=
 to 7M.  That will keep Lucene from trying to merge an index that won't fit=
 in your filesystem.  It will actually effectively round this down to the n=
ext lower power of `Index.mergeFactor`.
+
+So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M L=
ucene will generate a series of 1M document indexes, since merging 10 of th=
ese would exceed the maximum.
+
+A slightly more complex solution:
+
+You could further minimize the number of segments if, when you've added 7M=
 documents, optimize the index and start a new index.  Then use `MultiSearc=
her` to search the indexes.
+
+An even more complex and optimal solution:
+
+Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a =
subdirectory and represents the file as a series of files.
+
+
+=3D=3D=3D=3D Why is it important to use the same analyzer type during inde=
xing and search? =3D=3D=3D=3D
+
+The analyzer controls how the text is broken into terms which are then use=
d to index the document. If you are using analyzer of one type to index and=
 an analyzer of a different type to parse the search query, it is possible =
that the same word will be mapped to two different terms and this will resu=
lt in missing or false hits. =

+
+
+=3D=3D=3D=3D What is index optimization and when should I use it? =3D=3D=
=3D=3D
+
+The IndexWriter class supports an optimize() method that compacts the inde=
x database and speedup queries. You may want to use this method after perfo=
rming a complete indexing of your document set or after incremental updates=
 of the index. If your incremental update adds documents frequently, you wa=
nt to perform the optimization only once in a while to avoid the extra over=
head of the optimization.
+
+=3D=3D=3D=3D What are Segments? =3D=3D=3D=3D
+
+The index database is composed of 'segments' each stored in a separate fil=
e. When you add documents to the index, new segments may be created. You ca=
n compact the database and reduce the number of segments by optimizing it (=
see a separate question regarding index optimization). =

+ =

+
+=3D=3D=3D=3D Is Lucene index database platform independent? =3D=3D=3D=3D
+
+Yes, you can copy a Lucene index directory from one platform to another an=
d it will work just as well.
+ =

+
+=3D=3D=3D=3D When I recreate an index from scratch, do I have to delete th=
e old index files? =3D=3D=3D=3D
+
+No, creating the index writer with "true" should remove all old files in t=
he old index.
+
+ =

+=3D=3D=3D=3D How can I index and search digits and other non-alphabetic ch=
aracters? =3D=3D=3D=3D
+
+The components responsible for this are various `Analyzers.`
+
+The demos included in Lucene distribution use `StopAnalyzer`, which filter=
s out non-alphabetic characters. To include non-alphabetic characters, such=
 as digits and various punctuation characters in your index use `org.apache=
.lucene.analysis.standard.StandardAnalyzer` instead of `StopAnalyzer`.
+
+
+=3D=3D=3D=3D Is the IndexWriter class, and especially the method addIndexe=
s(Directory[]) thread safe? =3D=3D=3D=3D
+
+Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe.  It is a=
 `final synchronized` method.
+
+
+=3D=3D=3D=3D Do document IDs change after merging indices or after documen=
t deletion? =3D=3D=3D=3D
+
+Yes, document IDs do change.
+
+
+=3D=3D=3D=3D What is the purpose of write.lock file, when is it used, and =
by which classes? =3D=3D=3D=3D
+
+The write.lock is used to keep processes from concurrently attempting
+to modify an index. =

+
+It is obtained by an `IndexWriter` while it is open, and by an `IndexReade=
r` once documents have been deleted and until it is closed.
+
+
+=3D=3D=3D=3D What is the purpose of the commit.lock file, when is it used,=
 and by which classes? =3D=3D=3D=3D
+
+The commit.lock file is used to coordinate the contents of the 'segments'
+file with the files in the index.  It is obtained by an `IndexReader` befo=
re it reads the 'segments' file, which names all of the other files in the
+index, and until the `IndexReader` has opened all of these other files.
+
+The commit.lock is also obtained by the `IndexWriter` when it is about to =
write the segments file and until it has finished trying to delete obsolete=
 index files.
+
+The commit.lock should thus never be held for long, since while
+it is obtained files are only opened or deleted, and one small file is
+read or written.
+
+
+=3D=3D=3D=3D Is there a maximum number of segment infos whose summary (nam=
e and document count) is stored in the segments file? =3D=3D=3D=3D
+
+All segments in the index are listed in the segments file.  There is no ha=
rd limit. For an un-optimized index it is proportional to the log of the nu=
mber of documents in the index. An optimized index contains a single segmen=
t.
+
+
+=3D=3D=3D=3D What happens when I open an IndexWriter, optimize the index, =
and then close the IndexWriter?  Which files will be added or modified? =3D=
=3D=3D=3D
+
+All of the segments are merged into a single new segment file.
+If the index was empty to begin with, no segments will be created, only th=
e `segments` file.
+
+
+=3D=3D=3D=3D If I decide not to optimize the index, when will the deleted =
documents actually get deleted? =3D=3D=3D=3D
+
+Document that are deleted really are in deleted (???).  However, the space=
 they consume in the index does not get reclaimed until the index is optimi=
zed.  That space will also eventually be reclaimed as more documents are ad=
ded to the index, even if the index does not get optimized.
+
+
+=3D=3D=3D=3D How do I update a document or a set of documents that are alr=
eady indexed? =3D=3D=3D=3D
+
+There is no direct update procedure in Lucene. To update an index incremen=
tally you must first '''delete''' the documents that were updated, and '''t=
hen re-add''' them to the index.
+
+
+=3D=3D=3D=3D How do I write my own Analyzer? =3D=3D=3D=3D
+
+Here is an example:
+
+{{{
+public class MyAnalyzer extends Analyzer
+{
+    private static final Analyzer STANDARD =3D new StandardAnalyzer();
+
+    public TokenStream tokenStream(String field, final Reader reader) =

+    {
+        // do not tokenize field called 'element'
+        if ("element".equals(field)) {
+            return new CharTokenizer(reader) {
+                protected boolean isTokenChar(char c) {
+                    return true;
+                }
+            };
+        } else {
+            // use standard analyzer
+            return STANDARD.tokenStream(field, reader);
+        }
+    }
+}
+}}}
+
+
+=3D=3D=3D=3D How do I index non Latin characters? =3D=3D=3D=3D
+
+The solution is to ensure that the query string is encoded the same way th=
at strings in the index are. For instance, something along the lines of thi=
s will work if your index is also using UTF-8 encoding.
+
+{{{
+String queryStr =3D new String("query string here".getBytes("UTF-8"));
+}}}
+
+
+=3D=3D=3D=3D How can I index HTML documents? =3D=3D=3D=3D
+
+In order to index HTML documents you need to first parse them to extract t=
ext that you want to index from them.  Here are some HTML parsers that can =
help you with that:
+
+An example that uses JavaCC to parse HTML into Lucene Document  objects is=
 provided in the [http://jakarta.apache.org/lucene/docs/demo3.html Lucene w=
eb application demo] that comes with the Lucene distribution.
+
+The [http://www.apache.org/~andyc/neko/doc/html/ CyberNeko HTML Parser] le=
ts you parse HTML documents. It's relatively easy to remove most of the tag=
s from an HTML document (or all if you want), and then use the ones you lef=
t in to help create metadata for your Lucene document. NekoHTML also provid=
es a DOM model for navigating through the HTML.
+
+[http://jtidy.sourceforge.net/ JTidy] cleans up HTML, and can provide a DO=
M interface to the HTML files through a Java API.
+
+
+=3D=3D=3D=3D How can I index XML documents? =3D=3D=3D=3D
+
+In order to index XML documents you need to first parse them to extract te=
xt that you want to index from them.  Here are some XML parsers that can he=
lp you with that:
+
+See the [http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contribu=
tions/XML-Indexing-Demo/ XML Demo].  This contribution is some sample code =
that demonstrates adding simple XML documents into the index.  It creates a=
 new Document object for each file, and then populates the Document with a =
Field for each XML element, recursively. There are examples included for bo=
th SAX and DOM. =

+
+See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsi=
ng, indexing, and searching XML with Digester and Lucene].
+
+
+=3D=3D=3D=3D How can I index MS-Word documents? =3D=3D=3D=3D
+
+In order to index Word documents you need to first parse them to extract t=
ext that you want to index from them.  Here are some Word parsers that can =
help you with that:
+
+[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an early developme=
nt level Microsoft Word parser for versions of Word from Office 97, 2000, a=
nd XP.
+
+[http://www.textmining.org/ Simple Text Extractor Library], relies on POI.
+
+
+=3D=3D=3D=3D How can I index MS-Excel documents? =3D=3D=3D=3D
+
+In order to index Excel documents you need to first parse them to extract =
text that you want to index from them.  Here are some Excel parsers that ca=
n help you with that:
+
+[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an excellent Micro=
soft Excel parser for versions of Excel from Office 97, 2000, and XP.  You =
can also modify Excel files with this tool.
+
+
+=3D=3D=3D=3D How can I index MS-Powerpoint documents? =3D=3D=3D=3D
+
+In order to index Powerpoint documents you need to first parse them to ext=
ract text that you want to index from them.  You can use the [http://jakart=
a.apache.org/poi/ Jakarta Apache POI], as it contains a parser for Powerpoi=
nt documents.
+
+
+=3D=3D=3D=3D How can I index RTF documents? =3D=3D=3D=3D
+
+In order to index RTF documents you need to first parse them to extract te=
xt that you want to index from them.  Here are some RTF parsers that can he=
lp you with that:
+
+[http://www.tetrasix.com/ MajiX] is a translation utility that will turn R=
TF (Rich Text Format) files into XML files. These XML files could be indexe=
d like any other XML file, or you could write some custom code. (??? doesn'=
t seem to exist anymore -- mention Java's Swing widget instead that can be =
used to access RTF)
+
+
+=3D=3D=3D=3D How can I index PDF documents? =3D=3D=3D=3D
+
+In order to index PDF documents you need to first parse them to extract te=
xt that you want to index from them.  Here are some PDF parsers that can he=
lp you with that:
+
+[http://pdfbox.org/ PDFBox] is a Java API from Ben Litchfield that will le=
t you access the contents of a PDF document. It comes with integration clas=
ses for Lucene to translate a PDF into a Lucene document.
+
+[http://www.foolabs.com/xpdf/ XPDF]  is an open source tool that is licens=
ed under the GPL. It's not a Java tool, but there is a utility called pdfto=
text that can translate PDF files into text files on most platforms from th=
e command line.
+
+Based on xpdf, there is a utility called [http://pdftohtml.sourceforge.net=
/ pdftohtml] that can translate PDF files into HTML files. This is also not=
 a Java application.
+
+[http://www.jpedal.org/ JPedal] is a Java API for extracting text and imag=
es from PDF documents.
+
+
+=3D=3D=3D=3D How can I index JSP files? =3D=3D=3D=3D
+
+To index the content of JSPs that a user would see using a Web browser, yo=
u would need to write an application that acts as a Web client, in order to=
 mimic the Web browser behaviour (i.e. a web crawler).  Once you have such =
an application, you should be able to point it to the desired JSP, retrieve=
 the contents that the JSP generates, parse it, and feed it to Lucene. See =
[http://java-source.net/open-source/crawlers list of Open Source Crawlers i=
n Java].
+
+How to parse the output of the JSP depends on the type of content that the=
 JSP generates.  In most cases the content is going to be in HTML format.
+
+Most importantly, do not try to index JSPs by treating them as normal file=
s in your file system.  In order to index JSPs properly you need to access =
them via HTTP, acting like a Web client.
+
+
+=3D=3D=3D=3D If I use a compound file-style index, do I still need to opti=
mize my index? =3D=3D=3D=3D
+
+Yes.  Each .cfs file created in the compound file-style index represents a=
 single segment, which means you can still merge multiple segments into a s=
ingle segment by optimizing the index.
+
+
+=3D=3D=3D=3D What is the difference between IndexWriter.addIndexes(IndexRe=
ader[]) and IndexWriter.addIndexes(Directory[]), besides them taking differ=
ent arguments? =3D=3D=3D=3D
+
+When merging lots of indexes (more than the mergeFactor), the Directory-ba=
sed method will use fewer file handles and less memory, as it will only eve=
r open mergeFactor indexes at once, while the IndexReader-based method requ=
ires that all indexes be open when passed.
+
+The primary advantage of the IndexReader-based method is that one can pass=
 it IndexReaders that don't reside in a Directory.
+
+
+=3D=3D=3D=3D Can I use Lucene to index text in Chinese, Japanese, Korean, =
and other multi-byte character sets? =3D=3D=3D=3D
+
+Yes, you can.  Lucene is not limited to English, nor any other language.  =
To index text properly, you need to use an Analyzer appropriate for the lan=
guage of the text you are indexing.  Lucene's default Analyzers work well f=
or English.  There are a number of other Analyzers in [http://jakarta.apach=
e.org/lucene/docs/lucene-sandbox/ Lucene Sandbox], including those for Chin=
ese, Japanese, and Korean.
 =

- ||Lucene FAQ at JGuru||[http://www.jguru.com/faq/Lucene]||
- ||Original Lucene FAQs, no longer maintained||[http://lucene.sourceforge.=
net/cgi-bin/faq/faqmanager.cgi]||

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org