lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 28960] - Add "an" to the English stop words
Date Thu, 20 May 2004 16:52:35 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28960>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=28960

Add "an" to the English stop words

cutting@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |



------- Additional Comments From cutting@apache.org  2004-05-20 16:52 -------
This is a can of worms I'm hesitant to open.  If we add "an" then we'll be asked
to add "its", and if we add "its" we'll be asked to add "do", and so on.  This
stop list was originally generated by looking at the most frequent terms in a
collection.  I guess "an" was less frequent than "a" or any other word in that
collection.  There are other, better, ways to define stop lists, but I don't
think the Lucene project should be the business of providing high-quality stop
lists.  The Snowball project is a much better place for that sort of activity.

If you want a good, big, English stop list, grab:

  http://snowball.tartarus.org/english/stop.txt

I think the best long-term fix for this is to extend the Snowball library in the
sandbox (http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/) so that
it provides StopFilters for each of the stop lists provided by Snowball.  Once
we do this, we can deprecate uses of StopFilter and StopAnalysis that do not
specify a custom stop list.  The deprecation documentation can point folks to
the Snowball stop filters.  How does that sound?

Any volunteers to implement Snowball-based StopFilters?  I think this could just
be a static method, something like:
  public static StopFilter getStopFilter(String language);
The implementation could use ClasssLoader.getResource() to find a stop list file
packaged in the jar file, then parse the file and construct a StopFilter from
it.  It should probably also cache these, so that every call doesn't re-parse
the file.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message