lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: snowball (english) and filenames
Date Fri, 18 May 2007 06:46:48 GMT
> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo
> indicates it should do.

I am not sure about the "should" here - the way I see it, this
is just how the demo works: Snowball stemmers operate on words,
so the demo first breaks the input text into words and only
then applies stemming.

> For my lucene testing, I indexed one text file with one
> "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.
> It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the
> string based on the periods).

In Lucene the way text is "broken" into words is up to
application - and depends on the analyzer being used.
WhitespaceAnalyzer would break on white space. StandardAnalyzer
would do more sophisticated work. Analyzers are extendable,
so you could modify their behavior. The wiki page
"AnalysisParalysis" has some relevant info.

Using Lucene's SimpleAnalyzer btw would break "a.b.c" into
"a b c" which seems to be what you are looking for?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message