uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili" <tommaso.teof...@gmail.com>
Subject Language recognition
Date Mon, 08 Dec 2008 09:23:15 GMT
I am writing an AE pipeline and i need to recognize in which language the
starting document is written.
My idea is to use the Whitespace Tokenizer and the HMM Tagger together in
order to analyze the extracted tokens, calculate the percentage of well
known tokens for each language (against a dictionary) and then select the
highest percentage value language...
Do you know other (better) language recognition methods?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message