Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 4741 invoked from network); 8 Dec 2008 09:53:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Dec 2008 09:53:27 -0000 Received: (qmail 97372 invoked by uid 500); 8 Dec 2008 09:53:40 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 97345 invoked by uid 500); 8 Dec 2008 09:53:40 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 97334 invoked by uid 99); 8 Dec 2008 09:53:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2008 01:53:40 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [130.83.156.232] (HELO lnx503.hrz.tu-darmstadt.de) (130.83.156.232) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2008 09:52:07 +0000 Received: from callisto.tk.informatik.tu-darmstadt.de (callisto.tk.informatik.tu-darmstadt.de [130.83.163.139]) by lnx503.hrz.tu-darmstadt.de (8.13.8/8.13.8/HRZ/PMX) with ESMTP id mB89qrCp004901 for ; Mon, 8 Dec 2008 10:52:53 +0100 (envelope-from zesch@tk.informatik.tu-darmstadt.de) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Language recognition Date: Mon, 8 Dec 2008 10:52:44 +0100 Message-ID: <0DBCCB475CDE864F8F6086D69BFC5D9F02ADA055@CALLISTO.ntdom.tk.informatik.tu-darmstadt.de> In-reply-to: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Language recognition Thread-Index: AclZFqyA4X+ONLqqQrubm9xEvvSSMwAA2Flg References: From: "Torsten Zesch" To: X-PMX: seen v0.4 X-PMX-SPAMCHECK: outgoing mail: not checked X-Virus-Checked: Checked by ClamAV on apache.org Hi Tommaso, you could use TextCat http://odur.let.rug.nl/~vannoord/TextCat/ or one of its competitors: http://odur.let.rug.nl/~vannoord/TextCat/competitors.html -Torsten=20 > -----Original Message----- > From: Tommaso Teofili [mailto:tommaso.teofili@gmail.com]=20 > Sent: Monday, December 08, 2008 10:23 AM > To: uima-user@incubator.apache.org > Subject: Language recognition >=20 > Hello, > I am writing an AE pipeline and i need to recognize in which=20 > language the > starting document is written. > My idea is to use the Whitespace Tokenizer and the HMM Tagger=20 > together in > order to analyze the extracted tokens, calculate the=20 > percentage of well > known tokens for each language (against a dictionary) and=20 > then select the > highest percentage value language... > Do you know other (better) language recognition methods? > Thanks. > Tommaso >=20