lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elie Naulleau" <semio...@wanadoo.fr>
Subject RE: Automatically determin Language of document
Date Wed, 28 Nov 2001 09:20:51 GMT
Hi Stephan,

You'll find the example code attached to this message. The archive
contain also a model.dat file for french and english.
Remember that this is a simplisitic approach for language guessing.
It will works to distinguish between on french, english, spanish, etc
but is likely to fail between finnish, suedish, norvegian, ...etc
Porting to Java should be straightforward.

Elie

-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : mercredi 28 novembre 2001 08:40
À : 'Elie Naulleau'; 'Lucene Users List'
Objet : RE: Automatically determin Language of document


Hi Elie,

> You could try Doug Beeferman's variable-length character n-gram approach
> to identify a language among 13 european ones.
> http://www.dougb.com/ident.html

> If you just have 4 or 5 languages to deal with, you can build your
> own with the most frequent word lists for each language. I have some
> trivial C++ code that does it and can send it to you it you need.
> Identified language is choosen on a frequency criterion.
>

I have at the moment only two languages (en, de) but this could increase.
But I think not more than yours 4 to 5.
It would be great if you could send me your example code.
Probably I try to port it to Java.

Thanks in advance,

Stephan Strittmatter

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

Mime
View raw message