Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 34377 invoked from network); 17 Dec 2009 15:05:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Dec 2009 15:05:44 -0000 Received: (qmail 1298 invoked by uid 500); 17 Dec 2009 15:05:43 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 1231 invoked by uid 500); 17 Dec 2009 15:05:42 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 1223 invoked by uid 99); 17 Dec 2009 15:05:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2009 15:05:42 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2009 15:05:40 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7A55829A001F for ; Thu, 17 Dec 2009 07:05:20 -0800 (PST) Message-ID: <713131385.1261062320493.JavaMail.jira@brutus> Date: Thu, 17 Dec 2009 15:05:20 +0000 (UTC) From: "Dennis Kubes (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool In-Reply-To: <456363254.1227711284567.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: ------------------------------- Attachment: NUTCH-666-2-20091217-nf.patch Here is the patch as I last used it, almost a year ago now. I am not sure if it is functioning or not with the current codebase. It uses a hacky version of textcat to create fingerprint files on known language content, this creates a dictionary, that dictionary is configured through the textcat.conf file in the conf directory. The Language Identifier tool is then used to create a database of url -> langugage code, which before was included using the CustomFields job of the fields indexer. The other language analysis plugins from the previous patch acted off of locale or chosen language on the query side I think. > Analysis plugins for multiple language and new Language Identifier Tool > ----------------------------------------------------------------------- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.1 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.1 > > Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.