uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Torsten Zesch" <ze...@tk.informatik.tu-darmstadt.de>
Subject RE: Language recognition
Date Mon, 08 Dec 2008 09:52:44 GMT
Hi Tommaso,

you could use TextCat
http://odur.let.rug.nl/~vannoord/TextCat/

or one of its competitors:
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

-Torsten 

> -----Original Message-----
> From: Tommaso Teofili [mailto:tommaso.teofili@gmail.com] 
> Sent: Monday, December 08, 2008 10:23 AM
> To: uima-user@incubator.apache.org
> Subject: Language recognition
> 
> Hello,
> I am writing an AE pipeline and i need to recognize in which 
> language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger 
> together in
> order to analyze the extracted tokens, calculate the 
> percentage of well
> known tokens for each language (against a dictionary) and 
> then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
> 

Mime
View raw message