lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Problems when changing stoplist file
Date Thu, 11 Sep 2008 16:00:04 GMT
Hi Marie,

On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote:
> I am currently using the demo class IndexFiles to index some
> corpus. I have replaced the Standard by a GermanAnalyzer.
> Here, indexing works fine.
> But if i specify a different stopword list that should be
> used, the tokenization doesn't seem to work properly. Mostly
> some letters are missing at the end. Has somebody encountered
> a similar problem? What could be the problem?

Are you sure that this only occurs after you change the stopword list?

I assume you're using the GermanAnalyzer in contrib/; it constructs an analysis pipeline consisting
of StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and then  GermanStemFilter,
which invokes GermanStemmer <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_2/contrib/analyzers/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?view=markup>,
which is an implementation of the stemming algorithm described in the paper linked from here:
<http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html>.

A basic question to get out of the way: Are you aware that the stemming operation removes
letters from the end of some words?

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message