lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Basic question about indexing certain words
Date Sun, 27 Dec 2009 23:12:12 GMT
It depends completely on what analyzer you use. Conceptually, an Analyzer
is composed of a Tokenizer followed by any number of Filters. So the
input stream is broken up by the Tokenizer, then each token has one or
more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..

The reason I'm not answering your question directly is that I can't. If you
choose, say, a WhitespaceAnalyzer, which is built from a
WhitespaceTokenizer,
then your hyphens and apostrophes will pass through as-is, and your tokens
(the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
capitals
and all.

If you choose StandardAnalyzer, built on StandardTokenizer and several
filters
your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
well).

You can build your own Analyzers to process text however you please. Lucene
In Action has quite a thorough explanation of this process, you'll save
yourself
a bunch of time by reading those sections. You can get the second edition
of that book in electronic form from Manning through their early access
program.

Until you understand this process well, I'd recommend that you be very, very
sure that you use the *same* analyzer for both indexing and searching or
your
results will be...surprising.

Think about getting a copy of Luke to examine your indexes, that tool makes
it
easy to see the effects of various Analyzers. Google Lucene Luke....

Finally, you can easily use *different* analyzers for different fields
within a
document, see PerFieldAnalyzerWrapper.

HTH
Erick

On Sun, Dec 27, 2009 at 5:48 PM, syedfa <fayyazuddin@gmail.com> wrote:

>
> Dear fellow Java developers:
>
> I have a very basic question about indexing text using Lucene.  I am
> indexing a large amount of text, that includes names that contain certain
> punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.)  Will the punctuation
> throw off the indexer in any way, such that it breaks up the tokens when
> they shouldn't be, or will the indexer simply treat the punctuation inside
> the names as any other character, and the presence of the punctuation will
> not in any way hinder a user's ability to search for that name?  Are there
> any precautions that I should take to avoid any problems?
>
> I hope this question is clear and makes sense.
>
> Thanks in advance to all who reply.
>
> --
> View this message in context:
> http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message