nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marko Bauhardt (JIRA)" <>
Subject [jira] Created: (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls
Date Thu, 06 Aug 2009 10:36:15 GMT
inject&Index metadatas and inherit these metadatas to all matching suburls

                 Key: NUTCH-747
             Project: Nutch
          Issue Type: Improvement
          Components: indexer, injector
            Reporter: Marko Bauhardt
             Fix For: 1.1

the following two patches supports
+ inject metadatas to url's into a metadatadb <TAB> <METAKEY> : <TAB> <METAVALUE> <TAB> <METAVALUE>
+ updates the parse_data metadata from a shard and write the metadatas to all fetched urls
that starts with an url from the metadatadb
+ this patch support's metadata to all matching suburls inheritance

the second patch implements a index-metadata plugin.
+ this plugin extract all metadats from the parse_data of a shard and index it. which metadats
you can configure in the
+ to index for example the lang you have to configure the lang=STORE,UNTOKENIZED
+ that means that the index plugin exract metadata values with key "lang". if exists, all
values are indexed stored and untokenized


create start url's in "/tmp/urls/start/urls.txt"

create metadata url's in "/tmp/urls/metadata/urls.txt"     version:        1.0     version:        0.9

Inject Urls
bin/nutch inject crawldb /tmp/urls/start/
bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/

Fetch & Parse & Update
bin/nutch generate crawldb segments
bin/nutch fetch segments/20090806105717/
bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717
bin/nutch updatedb crawldb/ segments/20090806105717/

Fetch & Parse & Update Again

bin/nutch invertlinks linkdb -dir segments/
bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127

Check your Index
All urls starting with " " are indexed with "version:1.0".
All urls starting with " " are indexed with "version:0.9".

This issue is some related to

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message