lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: default AND operator
Date Sun, 17 Sep 2006 22:02:49 GMT
You probably want to tak a closer look at the StandardAnalyzer. It uses
StandardTokenizer and StandardFilter. From the javadoc

<<<<<StandardTokenizer

A grammar-based tokenizer constructed with JavaCC.

This should be a good tokenizer for most European-language documents:


   - Splits words at punctuation characters, removing punctuation.
   However, a dot that's not followed by whitespace is considered part of a
   token.
   - Splits words at hyphens, unless there's a number in the token, in
   which case the whole token is interpreted as a product number and is not
   split.
   - Recognizes email addresses and internet hostnames as one token.


any applications have specific tokenizer needs. If this tokenizer does not
suit your application, please consider copying this source code directory to
your project and maintaining your own grammar-based tokenizer.
>>>>

When I first started with Lucene, I was surprised that StandardAnalyzer did
the tricks it does. I quickly found that, especially when starting out, I
got more intuitive results by using one of the simpler analyzers,
WhitespaceAnalyzer, StopAnalyzer or SimpleAnalyzer.

And one of the coolest analyzers is PatternAnalyzer down in
org.apache.lucene.index.memory.PatternAnalyzer

which uses a regular expression to tokenize streams. But do note if you use
this that the regex recognizes tokens to *break* on, not what constitutes a
token....

Best
Erick

On 9/17/06, no spam <mrs.nospam@gmail.com> wrote:
>
> That question was badly worded.  I was trying to ask that when I write an
> index using the StandardAnalyzer, the docs are transformed using that
> analyzer then written to the index post transformation. So stop words or
> things like apostrophes would be removed.
>
> "Scott's Lawn and Garden Care"     becomes    "Scott Lawn Garden Care"
>
> It just seems that my index written using the StandardAnalyzer still has
> things like apostophes and also things like the & symbol.
>
> On 9/17/06, Chris Hostetter <hossman_lucene@fucit.org> wrote:
> >
> >
> > what do you mean "written to the index per field" .. analyzers aren't
> > written to the index at all, the analyzer used is completely forgotten
> > once your index is built.  if you want seperate analyzers per field,
> take
> > a look at the PerFieldAnalyzerWrapper (i think that's the name) ... as
> for
> > why Stemmed Queries might match on terms indexed using StandardAnalyzer
> > ... who knows ... it depends on how exactly they are getting stemmed,
> and
> > what other types of data might have made it into your index (maybe your
> > source data had the words you are searching on spelled incorrectly as
> > well, and it just happens to match the stemmed versions).
> >
> > When you have questions like this, searcher.explain is your friend.
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message