lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <oh...@cox.net>
Subject Re: Is there a list of "special" characters for standard analyzer?
Date Fri, 31 Jul 2009 15:00:22 GMT
Hi Ahmet,

Thanks for the clarification and information!  That was exactly what I was looking for.

Jim


---- AHMET ARSLAN <iorixxx@yahoo.com> wrote: 
> 
> > I guess that the obvious question is "Which characters are
> > considered 'punctuation characters'?".
>  
> Punctuation = ("_"|"-"|"/"|"."|",")
> 
> > In particular, does the analyzer consider "=" (equal) and
> > ":" (colon) to be punctuation characters?
> 
> ":" is special character at QueryParser (if you are using it). If you want to search
it you need to escape it first. At index time this character is ignored. Like the punctuations.
The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at
"=" character in both query/index time.
> 
> If you want to understand the behavior of StandardTokenizer, you need to look at the
file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE},
{ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There
are some definitions of these token types, similar to Regular Expression. You can change behavior
of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it.
There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can
understand how new token types can be added. 
> 
> Ahmet
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message