Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 69779 invoked from network); 8 Dec 2008 19:25:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Dec 2008 19:25:16 -0000 Received: (qmail 21788 invoked by uid 500); 8 Dec 2008 19:25:29 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 21611 invoked by uid 500); 8 Dec 2008 19:25:28 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 21600 invoked by uid 99); 8 Dec 2008 19:25:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2008 11:25:28 -0800 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dj_mccloskey@ie.ibm.com designates 195.212.29.137 as permitted sender) Received: from [195.212.29.137] (HELO mtagate4.uk.ibm.com) (195.212.29.137) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2008 19:25:13 +0000 Received: from d06nrmr1407.portsmouth.uk.ibm.com (d06nrmr1407.portsmouth.uk.ibm.com [9.149.38.185]) by mtagate4.uk.ibm.com (8.13.8/8.13.8) with ESMTP id mB8JOreK066100 for ; Mon, 8 Dec 2008 19:24:53 GMT Received: from d06av04.portsmouth.uk.ibm.com (d06av04.portsmouth.uk.ibm.com [9.149.37.216]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v9.1) with ESMTP id mB8JOrFi2412750 for ; Mon, 8 Dec 2008 19:24:53 GMT Received: from d06av04.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av04.portsmouth.uk.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id mB8JOqNN014278 for ; Mon, 8 Dec 2008 19:24:52 GMT Received: from d06ml901.portsmouth.uk.ibm.com (d06ml901.portsmouth.uk.ibm.com [9.149.39.138]) by d06av04.portsmouth.uk.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id mB8JOqOp014274 for ; Mon, 8 Dec 2008 19:24:52 GMT Subject: RE: Language recognition X-KeepSent: DF6FE49B:47BC0E0D-80257519:00695195; type=4; name=$KeepSent To: uima-user@incubator.apache.org Cc: Marie Wallace X-Mailer: Lotus Notes Build V85_09072008 September 07, 2008 Message-ID: From: "D.J. McCloskey" Date: Mon, 8 Dec 2008 19:24:51 +0000 X-MIMETrack: Serialize by Router on D06ML901/06/M/IBM(Release 8.0.1|February 07, 2008) at 08/12/2008 19:24:52 MIME-Version: 1.0 Content-type: multipart/mixed; Boundary="0__=0FBBFF8ADFFAD7058f9e8a93df938690918c0FBBFF8ADFFAD705" Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org --0__=0FBBFF8ADFFAD7058f9e8a93df938690918c0FBBFF8ADFFAD705 Content-type: text/plain; charset=US-ASCII Hi Tommaso, I saw the mail below on MarkMail and thought you might find what you need at http://www.alphaworks.ibm.com/tech/lrw. There's a new improved version coming soon but as it stands you will find automatic language identification annotator there which is fast and easy to improve. It also classifies languages when a sufficient confidence is not reached into complex text or simple text, essentially indicating whether ngramming or whitespace tokenization would be appropriate for further interrogation. Which languages are you interested in? The technology is available for evaluation and if you have further interest and would like to know more I'd be happy to help you. Subject: Language recognition(Embedded image moved to file: pic21701.gif)Link to this message From: Tommaso Teofili (tomm...@gmail.com) Date: 12/08/2008 01:22:52 AM List: org.apache.incubator.uima-user Hello, I am writing an AE pipeline and i need to recognize in which language the starting document is written. My idea is to use the Whitespace Tokenizer and the HMM Tagger together in order to analyze the extracted tokens, calculate the percentage of well known tokens for each language (against a dictionary) and then select the highest percentage value language... Do you know other (better) language recognition methods? Thanks. Tommaso Regards, -DJ ------------------- D.J McCloskey IBM LanguageWare Architect Email: dj_mccloskey@ie.ibm.com ... our external website: http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware IBM Ireland Product Distribution Limited registered in Ireland with number 92815. Registered office: Oldbrook House, 24-32 Pembroke Road, Ballsbridge, Dublin 4 --0__=0FBBFF8ADFFAD7058f9e8a93df938690918c0FBBFF8ADFFAD705--