lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Is there a list of "special" characters for standard analyzer?
Date Fri, 31 Jul 2009 07:26:53 GMT

> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
 
Punctuation = ("_"|"-"|"/"|"."|",")

> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?

":" is special character at QueryParser (if you are using it). If you want to search it you
need to escape it first. At index time this character is ignored. Like the punctuations. The
string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "="
character in both query/index time.

If you want to understand the behavior of StandardTokenizer, you need to look at the file
StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE},
{ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There
are some definitions of these token types, similar to Regular Expression. You can change behavior
of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it.
There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can
understand how new token types can be added. 

Ahmet


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message