Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 55567 invoked from network); 22 Dec 2008 09:38:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Dec 2008 09:38:19 -0000 Received: (qmail 64480 invoked by uid 500); 22 Dec 2008 09:38:19 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 64450 invoked by uid 500); 22 Dec 2008 09:38:19 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 64439 invoked by uid 99); 22 Dec 2008 09:38:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Dec 2008 01:38:19 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hannescarl@googlemail.com designates 209.85.200.172 as permitted sender) Received: from [209.85.200.172] (HELO wf-out-1314.google.com) (209.85.200.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Dec 2008 09:38:08 +0000 Received: by wf-out-1314.google.com with SMTP id 27so2090112wfd.21 for ; Mon, 22 Dec 2008 01:37:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:in-reply-to:mime-version:content-type:references :x-google-sender-auth; bh=pTW64BGRQiEUUh+ZX/SN90A/TxmEOX3uBy+dzMMnSlU=; b=jnUwjdefbxgUhIzpyNhR99J7z8ZhAzcc/tnuY31AigQ9qvuoU9sFMjviqilN+j0XyS dbwQ3GBe8iSHyj9nQXxwCsYh202nR3ID6whsUAI/iOI3TV75EFlUQvrgl2xXYB8xoYRM Z6NznG+Z6slsAhd7ZaycSGJ6xbty7RVyeJqZo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version :content-type:references:x-google-sender-auth; b=kX8aUgW7PP3MxL/RK7Wq9Vytf2cxQt9tB8fquxPuFa0zzb5TMVmYn+mN1Or4YXKRr4 iTfEHIfcMiqBVyVQt8ujkGXQEHnk/UnlY5F8K4cGGcCz9I6IRZgADfI4xm2MwydQSe7X TiPn/QggV4tsjr03ps8G9baBkkH7UpNgsvAZ8= Received: by 10.142.135.13 with SMTP id i13mr2600329wfd.217.1229938667080; Mon, 22 Dec 2008 01:37:47 -0800 (PST) Received: by 10.142.43.10 with HTTP; Mon, 22 Dec 2008 01:37:47 -0800 (PST) Message-ID: <8b9ff1e40812220137jc6c312wb61b049e8b3d4e22@mail.gmail.com> Date: Mon, 22 Dec 2008 10:37:47 +0100 From: "Hannes Carl Meyer" Sender: hannescarl@googlemail.com To: uima-user@incubator.apache.org Subject: Re: Language recognition In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_73900_5793812.1229938667063" References: X-Google-Sender-Auth: 278d1208ae532006 X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_73900_5793812.1229938667063 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, if you're experiencing problems with the results of n-gram based language recognition in a specific language, try to exclude profiles from languages you don't need to recognize! Regards, Hannes On Sun, Dec 21, 2008 at 6:55 PM, Tommaso Teofili wrote: > Hi, > I tried both NgramJ and LanguageWare for automatic language recognition in > text documents. > NgramJ does not work very well with all Italian language documents while it > gets the job done for French and English (tech docs too). > LanguageWare is a little more difficult to configure but it works much > better with many languages (Italian included). Furthermore it has some > interesting features like a "language candidates" collection of possible > languages for the document useful in case of high uncertainty. > Bye, > Tommaso > > > 2008/12/9 Tommaso Teofili > > > Hi, > > I think I'll give IBM LanguageWare a look because it seems very > interesting > > and I can easily plugin it into my existing annotator pipeline. > > I'll also try NGramJ and see which one has better performance. > > My goal is to recognize English, Italian and French. > > Thanks to all, I'll let you know here my results. > > Tommaso > > > > 2008/12/8 D.J. McCloskey > > > > > >> Hi Tommaso, > >> > >> I saw the mail below on MarkMail and thought you might find what you > need > >> at http://www.alphaworks.ibm.com/tech/lrw. > >> There's a new improved version coming soon but as it stands you will > find > >> automatic language identification annotator there which is fast and easy > >> to > >> improve. It also classifies languages when a sufficient confidence is > not > >> reached into complex text or simple text, essentially indicating whether > >> ngramming or whitespace tokenization would be appropriate for further > >> interrogation. Which languages are you interested in? > >> > >> The technology is available for evaluation and if you have further > >> interest > >> and would like to know more I'd be happy to help you. > >> > >> > >> Subject: Language recognition(Embedded > >> image moved to file: > >> pic21701.gif)Link to this > >> message > >> > >> From: Tommaso Teofili > >> (tomm...@gmail.com) > >> > >> Date: 12/08/2008 01:22:52 AM > >> > >> List: org.apache.incubator.uima-user > >> > >> > >> > >> > >> > >> > >> Hello, > >> > >> > >> I am writing an AE pipeline and i need to recognize in which language > the > >> starting document is written. My idea is to use the Whitespace Tokenizer > >> and the HMM Tagger together in order to analyze the extracted tokens, > >> calculate the percentage of well known tokens for each language (against > a > >> dictionary) and then select the highest percentage value language... Do > >> you > >> know other (better) language recognition methods? Thanks. Tommaso > >> > >> > >> Regards, > >> -DJ > >> ------------------- > >> D.J McCloskey > >> IBM LanguageWare Architect > >> Email: dj_mccloskey@ie.ibm.com > >> > >> ... our external website: > >> > >> > http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp > >> ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw > >> ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware > >> > >> IBM Ireland Product Distribution Limited registered in Ireland with > number > >> 92815. Registered office: Oldbrook House, 24-32 Pembroke Road, > >> Ballsbridge, Dublin 4 > > > > > > > ------=_Part_73900_5793812.1229938667063--