uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hannes Carl Meyer" <m...@hcmeyer.com>
Subject Re: Language recognition
Date Mon, 22 Dec 2008 09:37:47 GMT
Hi,
if you're experiencing problems with the results of n-gram based language
recognition in a specific language, try to exclude profiles from languages
you don't need to recognize!
Regards,
Hannes

On Sun, Dec 21, 2008 at 6:55 PM, Tommaso Teofili
<tommaso.teofili@gmail.com>wrote:

> Hi,
> I tried both NgramJ and LanguageWare for automatic language recognition in
> text documents.
> NgramJ does not work very well with all Italian language documents while it
> gets the job done for French and English (tech docs too).
> LanguageWare is a little more difficult to configure but it works much
> better with many languages (Italian included). Furthermore it has some
> interesting features like a "language candidates" collection of possible
> languages for the document useful in case of high uncertainty.
> Bye,
> Tommaso
>
>
> 2008/12/9 Tommaso Teofili <tommaso.teofili@gmail.com>
>
> > Hi,
> > I think I'll give IBM LanguageWare a look because it seems very
> interesting
> > and I can easily plugin it into my existing annotator pipeline.
> > I'll also try NGramJ and see which one has better performance.
> > My goal is to recognize English, Italian and French.
> > Thanks to all, I'll let you know here my results.
> > Tommaso
> >
> > 2008/12/8 D.J. McCloskey <dj_mccloskey@ie.ibm.com>
> >
> >
> >> Hi Tommaso,
> >>
> >> I saw the mail below on MarkMail and thought you might find what you
> need
> >> at http://www.alphaworks.ibm.com/tech/lrw.
> >> There's a new improved version coming soon but as it stands you will
> find
> >> automatic language identification annotator there which is fast and easy
> >> to
> >> improve. It also classifies languages when a sufficient confidence is
> not
> >> reached into complex text or simple text, essentially indicating whether
> >> ngramming or whitespace tokenization would be appropriate for further
> >> interrogation. Which languages are you interested in?
> >>
> >> The technology is available for evaluation and if you have further
> >> interest
> >> and would like to know more I'd be happy to help you.
> >>
> >>
> >>  Subject: Language recognition(Embedded
> >>          image moved to file:
> >>          pic21701.gif)Link to this
> >>          message
> >>
> >>  From:   Tommaso Teofili
> >>          (tomm...@gmail.com)
> >>
> >>  Date:   12/08/2008 01:22:52 AM
> >>
> >>  List:   org.apache.incubator.uima-user
> >>
> >>
> >>
> >>
> >>
> >>
> >> Hello,
> >>
> >>
> >> I am writing an AE pipeline and i need to recognize in which language
> the
> >> starting document is written. My idea is to use the Whitespace Tokenizer
> >> and the HMM Tagger together in order to analyze the extracted tokens,
> >> calculate the percentage of well known tokens for each language (against
> a
> >> dictionary) and then select the highest percentage value language... Do
> >> you
> >> know other (better) language recognition methods? Thanks. Tommaso
> >>
> >>
> >> Regards,
> >> -DJ
> >> -------------------
> >> D.J McCloskey
> >> IBM LanguageWare Architect
> >> Email: dj_mccloskey@ie.ibm.com
> >>
> >> ... our external website:
> >>
> >>
> http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp
> >> ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw
> >> ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware
> >>
> >> IBM Ireland Product Distribution Limited registered in Ireland with
> number
> >> 92815.  Registered office: Oldbrook House, 24-32 Pembroke Road,
> >> Ballsbridge, Dublin 4
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message