nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-828) Fetch Filter
Date Wed, 02 Nov 2011 08:24:35 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-828:
--------------------------------

         Due Date: 9/Jun/10  (was: 9/Jun/10)
    Fix Version/s: 1.5
    
> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows filtering content
and parse data/text after it is fetched but before it is written to segments.  The fliter
can return true if content is to be written or false if it is not.  
> Some use cases for this filter would be topical search engines that only want to fetch/index
certain types of content, for example a news or sports only search engine.  In these types
of situations the only way to determine if content belongs to a particular set is to fetch
the page and then analyze the content.  If the content passes, meaning belongs to the set
of say sports pages, then we want to include it.  If it doesn't then we want to ignore it,
never fetch that same page in the future, and ignore any urls on that page.  If content is
rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content
is ignored and not written to segments.  This effectively stop crawling along the crawl path
of that page and the urls from that page.  An example filter, fetch-safe, is provided that
allows fetching content that does not contain a list of bad words.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message