lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <oh...@cox.net>
Subject Re: Is there a list of "special" characters for standard analyzer?
Date Thu, 01 Jan 1970 00:00:00 GMT

---- Phil Whelan <phil123@gmail.com> wrote: 
> On Thu, Jul 30, 2009 at 7:12 PM, <ohaya@cox.net> wrote:
> > I was wonder if there is a list of special characters for the standard analyzer?
> >
> > What I mean by "special" is characters that the analyzer considers break characters.
> > For example, if I have something like "foo=something", apparently the analyzer
> > considers this as two terms, "foo" and "something.
> 
> Hi Jim,
> 
> This is what I could find in the docs...
> 
> StandardAnalyzer uses StandardTokenizer
> 
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
> * Splits words at punctuation characters, removing punctuation.
> However, a dot that's not followed by whitespace is considered part of
> a token.
> * Splits words at hyphens, unless there's a number in the token, in
> which case the whole token is interpreted as a product number and is
> not split.
> * Recognizes email addresses and internet hostnames as one token.
> 
> Also, these are the tokens that will be removed..
> 
>   public static final String[] ENGLISH_STOP_WORDS = {
>     "a", "an", "and", "are", "as", "at", "be", "but", "by",
>     "for", "if", "in", "into", "is", "it",
>     "no", "not", "of", "on", "or", "such",
>     "that", "the", "their", "then", "there", "these",
>     "they", "this", "to", "was", "will", "with"
>   };
> 
> Thanks,
> Phil
> 


Hi Phil,

I guess that the obvious question is "Which characters are considered 'punctuation characters'?".

In particular, does the analyzer consider "=" (equal) and ":" (colon) to be punctuation characters?

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message