lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Lucene-java Wiki] Update of "LuceneCaveats" by RenaudWaldura
Date Fri, 29 Jun 2007 21:25:42 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by RenaudWaldura:

New page:
A list of classic Lucene mistakes. While these solutions are all documented in one place or
another, they won't suffer from additional repetition.


===Luke is your friend===

Luke is an invaluable tool to learn what actually went in your index.
If it's not in the index, you can't query it. 

===Use the same analyzer for indexing and querying===

Make sure you use the same Analyzer class when building your index
and later querying against that index. Analysis dictates what goes
into your index and how. The QueryParser needs this information 
to generate the proper queries. 

Dissimilar or incompatible analyzers lead to mysterious search
behavior. See:

===Large documents are truncated by default===

The indexer will be default truncate documents to IndexWriter.DEFAULT_MAX_FIELD_LENGTH 
or 10,000 terms in Lucene 2.0. This limit 
can easily be changed with IndexWriter.setMaxFieldLength().

===Stopwords are removed===

StandardAnalyzer (the most commonly recommended analyzer) does not index
"stopwords". Stopwords are common English words such "the", "a", etc. -- the default list
StopAnalyzer.ENGLISH_STOP_WORDS. These words are completely ignored and 
cannot be searched for, at all. This means that even phrase queries aren't 
exact. E.g. the phrase query "to be or not to be" finds nothing at all.


===Keep the index open===

Especially in a Web application, where keeping state between requests 
requires additional work, it is tempting to open and close the index 
on every request. Unfortunately, this leads to very poor performance.
At first this might work with small indexes or beefy hardware, but you
will soon run into performance problems -- e.g. large garbage collections.

You should keep the index open as long as possible. Both IndexReader
and IndexSearcher are thread-safe and don't require additional
synchronization. One could cache the index searcher e.g. in the application 

See also:

===There is rarely a need to cache search results===

Lucene is amazingly fast at retrieving hits. Rather than caching hits
and paging through them, merely re-executing the query is almost always 
fast enough. 


When doing sub-searches (searching the results of a previous search),
it's easier (and fast enough) to add the first query as a conditional
to the second. 

But if you really need to do this, see:

===Use RangeFilter instead of RangeQuery===

RangeQuery expands every term in the range to a boolean expression,
and easily blows past the built-in BooleanQuery.maxClauseCount limit
(Lucene 2.0 defaults to about 1000).

RangeFilter doesn't suffer from this limitation.



===Edit the query rather than the string===

Parsing free text is a surprisingly hard problem at which QueryParser does a
pretty good job. Rather than editing the query string, change the Query 
objects returned by the QueryParser.

E.g. when offering additional search options in a Web form, it's easier and safer
to combine them with the parsed Query object rather doing text manipulations on 
the query string.

See also:

===Lucene is not a true boolean system===

Or: "apple AND banana OR orange" doesn't work. QueryParser does its best to 
translate from a boolean syntax to Lucene's own set-oriented queries, but
it falls short. Either use parens everywhere or try to design your user
interface accordingly.


===Iterating over all hits takes a long time===

This is by design. Try using a HitsCollector instead if you need access to
all the hits for a search.

===Highlighting search results===

The Lucene-contributed highlighter does the best it can, but it doesn't have 
the information it really needs to correctly highlight the search results.


View raw message