Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 5238 invoked from network); 6 Aug 2009 10:38:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Aug 2009 10:38:40 -0000 Received: (qmail 10667 invoked by uid 500); 6 Aug 2009 10:38:46 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 10612 invoked by uid 500); 6 Aug 2009 10:38:46 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 10604 invoked by uid 99); 6 Aug 2009 10:38:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 10:38:46 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 10:38:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B054E234C044 for ; Thu, 6 Aug 2009 03:38:15 -0700 (PDT) Message-ID: <294494010.1249555095708.JavaMail.jira@brutus> Date: Thu, 6 Aug 2009 03:38:15 -0700 (PDT) From: "Marko Bauhardt (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Updated: (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls In-Reply-To: <1256676549.1249554975084.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Bauhardt updated NUTCH-747: --------------------------------- Attachment: index-metadata.patch metadata.patch > inject&Index metadatas and inherit these metadatas to all matching suburls > -------------------------------------------------------------------------- > > Key: NUTCH-747 > URL: https://issues.apache.org/jira/browse/NUTCH-747 > Project: Nutch > Issue Type: Improvement > Components: indexer, injector > Reporter: Marko Bauhardt > Fix For: 1.1 > > Attachments: index-metadata.patch, metadata.patch > > > Hi. > the following two patches supports > + inject metadatas to url's into a metadatadb > url.com : : ... > ... > + updates the parse_data metadata from a shard and write the metadatas to all fetched urls that starts with an url from the metadatadb > + this patch support's metadata to all matching suburls inheritance > the second patch implements a index-metadata plugin. > + this plugin extract all metadats from the parse_data of a shard and index it. which metadats you can configure in the plugin.properties. > + to index for example the lang you have to configure the plugin.properties: lang=STORE,UNTOKENIZED > + that means that the index plugin exract metadata values with key "lang". if exists, all values are indexed stored and untokenized > Example > create start url's in "/tmp/urls/start/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/index.html > http://lucene.apache.org/nutch/apidocs-0.9/index.html > create metadata url's in "/tmp/urls/metadata/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/ version: 1.0 > http://lucene.apache.org/nutch/apidocs-0.9/ version: 0.9 > Inject Urls > bin/nutch inject crawldb /tmp/urls/start/ > bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/ > Fetch & Parse & Update > bin/nutch generate crawldb segments > bin/nutch fetch segments/20090806105717/ > bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717 > bin/nutch updatedb crawldb/ segments/20090806105717/ > Fetch & Parse & Update Again > ... > Index > bin/nutch invertlinks linkdb -dir segments/ > bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127 > Check your Index > All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are indexed with "version:1.0". > All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are indexed with "version:0.9". > This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.