Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82319 invoked from network); 21 Nov 2006 22:30:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Nov 2006 22:30:29 -0000 Received: (qmail 46308 invoked by uid 500); 21 Nov 2006 22:30:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46284 invoked by uid 500); 21 Nov 2006 22:30:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46273 invoked by uid 99); 21 Nov 2006 22:30:32 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Nov 2006 14:30:32 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [64.90.160.18] (HELO server1.threattracker.com) (64.90.160.18) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Nov 2006 14:30:18 -0800 Received: from [192.168.1.98] (gate.marathonconsulting.com [66.253.5.226]) (authenticated) by server1.threattracker.com (8.11.6/8.11.6) with ESMTP id kALMU9c31162 for ; Tue, 21 Nov 2006 17:30:09 -0500 Message-ID: <45637F4C.4030808@alias-i.com> Date: Tue, 21 Nov 2006 17:35:56 -0500 From: Bob Carpenter User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Analyzers and multiple languages (language detection) References: <452F4381.5000106@teamware.com> In-Reply-To: <452F4381.5000106@teamware.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Antony Bowesman wrote: > Hello, > > I'm new to Lucene and wanted some advice on analyzers, stemmers and > language analysis. I've got LIA, so have read it's chapters. > > I am writing a framework that needs to be able to index documents from a > range of languages where just the character set of the document is > known. Has anyone looked at or is using language analysis to determine > the language of a document in ISO-8859-1. Language ID is pretty easy. The best way to do it wholly within Lucene would be with a separate index containing one document per language, with an analyzer that returned weighted character n-grams. You can read about our analyzer to do that in LIA. This is what some of the packages such as Gertjan van Noord's do. If you need very high accuracy, you could also use our language ID, which is based on a probabilistic classifier. You can check out our tutorial at: http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html Accuracy depends on the pair of languages (some are more confusible than others), as well as length of input (it's very hard with only one or two words, especially if it's a a name). - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org