lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elie Naulleau" <semio...@wanadoo.fr>
Subject RE: Automatically determin Language of document
Date Fri, 23 Nov 2001 14:13:37 GMT

Hi,
You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.
http://www.dougb.com/ident.html

I tried it and it works pretty well. It's based on a similarity mesure
(cosine)
between a corpus-model and the input text.

There are issues depending on which character set you used (iso-latin1,
or other asci flavor).

If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.

Of course commercial product are available for that (try Xerox & inXight
for instance $$).

The point is how many language you have to identify...

Complement:
Some time ago, Bright Station (UK) had some open source C/C++ code
 for a variety of stemmers for european language (adapted from
the Porter stemmer approach).

I hope this helps,
Elie Naulleau
Semio-Sys


-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : vendredi 23 novembre 2001 14:56
À : 'Lucene Users List'
Objet : Automatically determin Language of document


Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message