Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 31742 invoked from network); 22 Aug 2005 07:51:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 Aug 2005 07:51:56 -0000 Received: (qmail 42525 invoked by uid 500); 22 Aug 2005 07:51:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42492 invoked by uid 500); 22 Aug 2005 07:51:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42479 invoked by uid 99); 22 Aug 2005 07:51:50 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Aug 2005 00:51:50 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Aug 2005 00:52:07 -0700 Received: from [192.168.0.254] (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id j7M7pki02575 for ; Mon, 22 Aug 2005 02:51:46 -0500 Message-ID: <4309840F.3040008@getopt.org> Date: Mon, 22 Aug 2005 09:51:43 +0200 From: Andrzej Bialecki User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: NGram Language Categorization Source References: <30c6373b050819144231447954@mail.gmail.com> <430791A4.3080700@getopt.org> <30c6373b050821134560a3bb11@mail.gmail.com> In-Reply-To: <30c6373b050821134560a3bb11@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Kevin Burton wrote: >>A lot depends on the reference profiles (which in turn depend on the >>quality of your training corpus - in this case, your corpus is not the >>best choice, because each text contains a lot of foreign words). > > > I realize that my corpus isnt' the best. That's one of the reason's > I've open source'd it. The main improvement in ngramcat (my code) is > that if the result isn't obvious we throw an Exception so > theoreticallyi we won't see any false positives unless the language > categorization is WAY off. That's also how other implementations do it - you need to set an arbitrary threshold, and if the profiles score below that threshold then an "unknown" value is returned (or null, or Exception). > > >>It was >>also found that the way you create ngram profiles (e.g. with or without >>surrounding spaces, single length or mixed length) affects the LI >>performance. > > > LI??? > LI = Language Identification. Sorry for the confusion. > I haven't benchmarked it but I'd be interested in any suggestions you have. > > >>For documents with mixed languages it was also found that >>methods, which combine ngrams with stopwords, work better. > > > Hm.. interesting.. where? URL I can reads? Someone mentioned the Linguini paper, where they found that using "short words" features gives similarly good performance as using ngrams. See also the following papers : http://www.xrce.xerox.com/Publications/Attachments/1995-012/Gref---Comparing-two-language-identification-schemes.pdf http://citeseer.ist.psu.edu/40861.html http://www.xs4all.nl/~ajwp/langident.pdf In general, using stop words works only for texts above certain minimum length (greater than with n-gram methods), and then gives nearly 100% accuracy. >>So, there is still a lot to do in this area, if you come up with some >>unique way of improving LI performance... > > > Maybe I'm being dense but what is LI performance? Language Identification performance - in the sense that a given identifier "performs" better if it can correctly identify more languages, using shorter input text. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org