lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: StrictAnalyzer Proposal
Date Wed, 20 Feb 2002 17:08:10 GMT
Hello,
 
> (1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am
> working
> on.  StandardAnalyzer does not filter enough words for my liking.
> Basically all I did was add to the STOP_WORDS array.  The stop words
> I added
> are based on the default values in SQL Server 2000's text indexing. 
> (Source code below)

The change seems simple and looks fine to me.  If nobody complains
until tonight I'll commit it.
I'd recommend using explicit imports (not import ....*;) in the future.

> (2)I would also like to propose a change to StandardTokenizer which
> supports
> strings with a trailing and/or leading comma(s) such as "therefore,"
> and
> ",ice,".  Currently StandardTokenizer is not returning any results
> for some
> of my most basic searches because of commas adjacent to words.
> 
> Comments, suggestions, questions?

Hm, shouldn't that be filtered by one of the analyzers at both indexing
and searching time?  Are you using Stop analyzer?
Please also see http://www.jguru.com/faq/view.jsp?EID=538308

Otis

> import org.apache.lucene.analysis.*;
> import java.io.Reader;
> import java.util.Hashtable;
> 
> /** Filters {@link StandardTokenizer} with {@link StandardFilter},
> {@link
>  * LowerCaseFilter} and {@link StopFilter}. */
> public final class StrictAnalyzer extends Analyzer {
>   private Hashtable stopTable;
> 
>   /** An array containing some common English words that are not
> usually
> useful
>     for searching. */
>   public static final String[] STOP_WORDS = {
>     "0","1","2","3","4","5","6","7","8","9", 
>     "$", 
>     "about",  "after",  "all", "also",  "an",  "and", 
>     "another", "any", "are", "as", "at", "be", "because",
>     "been", "before", "being", "between", "both", "but",
>     "by","came","can","come","could","did","do","does",
>     "each","else","for","from","get","got","has","had",
>     "he","have","her","here","him","himself","his","how",
>     "if","in","into","is","it","its","just","like","make",
>     "many","me","might","more","most","much","must","my",
>     "never","now","of","on","only","or","other","our","out",
>     "over","re","said","same","see","should","since","so",
>     "some","still","such","take","than","that","the","their",
>     "them","then","there","these","they","this","those","through",
>     "to","too","under","up","use","very","want","was","way","we",
>     "well","were","what","when","where","which","while","who","will",
>     "with","would","you","your",
>  
>
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s",
> "t","u","v","w","x","y","z"
> 
>   };
> 
>   /** Builds an analyzer. */
>   public StrictAnalyzer() {
>     this(STOP_WORDS);
>   }
> 
>   /** Builds an analyzer with the given stop words. */
>   public StrictAnalyzer(String[] stopWords) {
>     stopTable = StopFilter.makeStopTable(stopWords);
>   }
> 
>   /** Constructs a {@link StandardTokenizer} filtered by a {@link
>    * StandardFilter}, a {@link LowerCaseFilter} and a {@link
> StopFilter}. */
>   public final TokenStream tokenStream(String fieldName, Reader
> reader) {
>     TokenStream result = new StandardTokenizer(reader);
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, stopTable);
>     return result;
>   }
> }


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message