uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili" <tommaso.teof...@gmail.com>
Subject Re: Language recognition
Date Tue, 09 Dec 2008 09:32:53 GMT
Hi,
I think I'll give IBM LanguageWare a look because it seems very interesting
and I can easily plugin it into my existing annotator pipeline.
I'll also try NGramJ and see which one has better performance.
My goal is to recognize English, Italian and French.
Thanks to all, I'll let you know here my results.
Tommaso

2008/12/8 D.J. McCloskey <dj_mccloskey@ie.ibm.com>

>
> Hi Tommaso,
>
> I saw the mail below on MarkMail and thought you might find what you need
> at http://www.alphaworks.ibm.com/tech/lrw.
> There's a new improved version coming soon but as it stands you will find
> automatic language identification annotator there which is fast and easy to
> improve. It also classifies languages when a sufficient confidence is not
> reached into complex text or simple text, essentially indicating whether
> ngramming or whitespace tokenization would be appropriate for further
> interrogation. Which languages are you interested in?
>
> The technology is available for evaluation and if you have further interest
> and would like to know more I'd be happy to help you.
>
>
>  Subject: Language recognition(Embedded
>          image moved to file:
>          pic21701.gif)Link to this
>          message
>
>  From:   Tommaso Teofili
>          (tomm...@gmail.com)
>
>  Date:   12/08/2008 01:22:52 AM
>
>  List:   org.apache.incubator.uima-user
>
>
>
>
>
>
> Hello,
>
>
> I am writing an AE pipeline and i need to recognize in which language the
> starting document is written. My idea is to use the Whitespace Tokenizer
> and the HMM Tagger together in order to analyze the extracted tokens,
> calculate the percentage of well known tokens for each language (against a
> dictionary) and then select the highest percentage value language... Do you
> know other (better) language recognition methods? Thanks. Tommaso
>
>
> Regards,
> -DJ
> -------------------
> D.J McCloskey
> IBM LanguageWare Architect
> Email: dj_mccloskey@ie.ibm.com
>
> ... our external website:
> http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp
> ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw
> ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware
>
> IBM Ireland Product Distribution Limited registered in Ireland with number
> 92815.  Registered office: Oldbrook House, 24-32 Pembroke Road,
> Ballsbridge, Dublin 4

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message