lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: svn commit: r428998 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/analysis/StopAnalyzer.java src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java
Date Sat, 05 Aug 2006 20:31:16 GMT
Stop words and stemming always make literal searching less precise,
with the general benefit of greater matching power (more general) and
smaller index size.

Where did the English stop word list come from?  I feel as if I don't
have enough info to judge if this is a good change or not.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 8/5/06, dnaber@apache.org <dnaber@apache.org> wrote:
> Author: dnaber
> Date: Sat Aug  5 06:11:09 2006
> New Revision: 428998
>
> URL: http://svn.apache.org/viewvc?rev=428998&view=rev
> Log:
> remove "s" and "t" as stopwords because they make searching less precise, e.g. "t-online"
gives the same results as "online" with "t" being a stopword
>
> Modified:
>     lucene/java/trunk/CHANGES.txt
>     lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java
>     lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java
>
> Modified: lucene/java/trunk/CHANGES.txt
> URL: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=428998&r1=428997&r2=428998&view=diff
> ==============================================================================
> --- lucene/java/trunk/CHANGES.txt (original)
> +++ lucene/java/trunk/CHANGES.txt Sat Aug  5 06:11:09 2006
> @@ -4,6 +4,15 @@
>
>  Trunk (not yet released)
>
> +Changes in runtime behavior
> +
> + 1. 's' and 't' have been removed from the list of default stopwords
> +    in StopAnalyzer (also used in by StandardAnalyzer). Having e.g. 's'
> +    as a stopword meant that 's-class' led to the same results as 'class'.
> +    Note that this problem still exists for 'a', e.g. in 'a-class' as
> +    'a' continues to be a stopword.
> +    (Daniel Naber)
> +
>  New features
>
>   1. LUCENE-503: New ThaiAnalyzer and ThaiWordFilter in contrib/analyzers
>
> Modified: lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java
> URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff
> ==============================================================================
> --- lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java (original)
> +++ lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java Sat Aug 
5 06:11:09 2006
> @@ -31,8 +31,8 @@
>    public static final String[] ENGLISH_STOP_WORDS = {
>      "a", "an", "and", "are", "as", "at", "be", "but", "by",
>      "for", "if", "in", "into", "is", "it",
> -    "no", "not", "of", "on", "or", "s", "such",
> -    "t", "that", "the", "their", "then", "there", "these",
> +    "no", "not", "of", "on", "or", "such",
> +    "that", "the", "their", "then", "there", "these",
>      "they", "this", "to", "was", "will", "with"
>    };
>
>
> Modified: lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java
> URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff
> ==============================================================================
> --- lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java (original)
> +++ lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java Sat
Aug  5 06:11:09 2006
> @@ -55,7 +55,17 @@
>      // possessives are actually removed by StardardFilter, not the tokenizer
>      assertAnalyzesTo(a, "O'Reilly", new String[]{"o'reilly"});
>      assertAnalyzesTo(a, "you're", new String[]{"you're"});
> +    assertAnalyzesTo(a, "she's", new String[]{"she"});
> +    assertAnalyzesTo(a, "Jim's", new String[]{"jim"});
> +    assertAnalyzesTo(a, "don't", new String[]{"don't"});
>      assertAnalyzesTo(a, "O'Reilly's", new String[]{"o'reilly"});
> +
> +    // t and s had been stopwords in Lucene <= 2.0, which made it impossible
> +    // to correctly search for these terms:
> +    assertAnalyzesTo(a, "s-class", new String[]{"s", "class"});
> +    assertAnalyzesTo(a, "t-com", new String[]{"t", "com"});
> +    // 'a' is still a stopword:
> +    assertAnalyzesTo(a, "a-class", new String[]{"class"});
>
>      // company names
>      assertAnalyzesTo(a, "AT&T", new String[]{"at&t"});

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message