uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D.J. McCloskey" <dj_mcclos...@ie.ibm.com>
Subject RE: Language recognition
Date Mon, 08 Dec 2008 19:24:51 GMT

Hi Tommaso,

I saw the mail below on MarkMail and thought you might find what you need
at http://www.alphaworks.ibm.com/tech/lrw.
There's a new improved version coming soon but as it stands you will find
automatic language identification annotator there which is fast and easy to
improve. It also classifies languages when a sufficient confidence is not
reached into complex text or simple text, essentially indicating whether
ngramming or whitespace tokenization would be appropriate for further
interrogation. Which languages are you interested in?

The technology is available for evaluation and if you have further interest
and would like to know more I'd be happy to help you.

                                           
 Subject: Language recognition(Embedded    
          image moved to file:             
          pic21701.gif)Link to this        
          message                          
                                           
  From:   Tommaso Teofili                  
          (tomm...@gmail.com)              
                                           
  Date:   12/08/2008 01:22:52 AM           
                                           
  List:   org.apache.incubator.uima-user   
                                           





Hello,


I am writing an AE pipeline and i need to recognize in which language the
starting document is written. My idea is to use the Whitespace Tokenizer
and the HMM Tagger together in order to analyze the extracted tokens,
calculate the percentage of well known tokens for each language (against a
dictionary) and then select the highest percentage value language... Do you
know other (better) language recognition methods? Thanks. Tommaso


Regards,
-DJ
-------------------
D.J McCloskey
IBM LanguageWare Architect
Email: dj_mccloskey@ie.ibm.com

... our external website:
http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp
... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw
... our Wikipedia: http://en.wikipedia.org/wiki/Languageware

IBM Ireland Product Distribution Limited registered in Ireland with number
92815.  Registered office: Oldbrook House, 24-32 Pembroke Road,
Ballsbridge, Dublin 4
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message