nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-655) Injecting Crawl metadata
Date Fri, 11 Dec 2009 10:33:18 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-655:
--------------------------------

    Attachment: NUTCH-655.v2

Improved version of the patch which allows to specify custom scores for the URLs. A score
is specified by simply setting a float value instead of a name=value couple e.g. 
http://www.lemonde.fr/    label=newspaper  10.0
http://www.lequipe.fr/    label=sports  2.0

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file has to
contain fields separated by tabs, with the URL being on the first column. The metadata names
and values are separated by '='. A input line might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it with a custom
plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message