Message view | « Date » · « Thread » |
---|---|
Top | « Date » · « Thread » |
From | "Tommaso Teofili" <tommaso.teof...@gmail.com> |
Subject | Language recognition |
Date | Mon, 08 Dec 2008 09:23:15 GMT |
Hello, I am writing an AE pipeline and i need to recognize in which language the starting document is written. My idea is to use the Whitespace Tokenizer and the HMM Tagger together in order to analyze the extracted tokens, calculate the percentage of well known tokens for each language (against a dictionary) and then select the highest percentage value language... Do you know other (better) language recognition methods? Thanks. Tommaso | |
Mime |
|
View raw message |