uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hannes Carl Meyer" <hannesc...@googlemail.com>
Subject Re: Language recognition
Date Mon, 08 Dec 2008 09:53:14 GMT
Hi Tommaso,

one common method for language recognition is based on n-grams.
There are also some java implementations out there, for example NGramJ:
http://ngramj.sourceforge.net/

Nutch (crawler from Lucene) also uses the n-gram approach, find some
information about here http://wiki.apache.org/nutch/LanguageIdentifier and
here http://wiki.apache.org/nutch/LanguageIdentifierPlugin

I wouldn't suggest to reinvent the wheel unless it is a bigger, faster one!

Regards

Hannes
---
http://mimblog.de

On Mon, Dec 8, 2008 at 10:23 AM, Tommaso Teofili
<tommaso.teofili@gmail.com>wrote:

> Hello,
> I am writing an AE pipeline and i need to recognize in which language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger together in
> order to analyze the extracted tokens, calculate the percentage of well
> known tokens for each language (against a dictionary) and then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message