lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Couture" <ncout...@convera.com>
Subject RE: language identifier contrib
Date Mon, 13 Jan 2003 14:11:17 GMT
Snowball stemmer is not of very good quality. I think the best would be to build a lemmatizer
from ispell more precisely from the ispell rules syntax. As for the language identifier the
best overall language identifier is based on Ted Dunning. You can find the source code on
the web. 


its c code but can easily be ported to java. Also of interest is the Mozilla source code,
there is code that do encoding detection. In fact I devellloped a java lib starting from that
source code. Its based upon the LGPL license would you be interested to merge that source
code in Lucene?


-Neil



-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: 7 janvier, 2003 12:06
To: lucene-dev@jakarta.apache.org
Subject: language identifier contrib


Now that Doug put Snowball's stemmer's in Lucene Sandbox, it would be
nice to have that language recognition contribution that somebody
mentioned a month or two ago.

Ah, here it is, the original email that mentions this language
identifier:
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=2695

There's also this:
http://frank.spieleck.de/ngram/

Thanks,
Otis


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message