nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Language-focused crawling
Date Sun, 01 Jul 2012 18:45:48 GMT
It's a use case for a fetch filter:
https://issues.apache.org/jira/browse/NUTCH-828

 
 
-----Original message-----
> From:Alexander Aristov <alexander.aristov@gmail.com>
> Sent: Sun 01-Jul-2012 20:43
> To: user@nutch.apache.org; safdar.kureishy@gmail.com
> Subject: Re: Language-focused crawling
> 
> Hi
> 
> First of all you understand that in order to detect page language the page
> must be crawled and at least sent to parser. As you admitted
> language-identifier filter adds lang field and that's it.
> 
> You will need to modify or write your own filter that would discard
> unwanted languages (return null).
> 
> scoring filters are something different and not suitable for the purpose.
> 
> As for indexing pages referenced by desired paged then the solution might
> be is to add a flag to outlink metadata which then would be used to pass
> the page through your filter.
> 
> This all is not really difficult if you have necessary programing skills
> and strong desire. :)
> 
> 
> Best Regards
> Alexander Aristov
> 
> 
> On 1 July 2012 17:00, Safdar Kureishy <safdar.kureishy@gmail.com> wrote:
> 
> > Hi,
> >
> > I would like to do a focused web crawl using Nutch, for all pages of a
> > specific language - let's say "lang". However, the default
> > language-identifier plugin from Nutch does not support this language.
> >
> > The heuristic I'd like to use is that I want all pages pointed to by pages
> > containing "lang" content to be crawled, but pages that are pointed to by
> > non-"lang" pages should not be crawled (unless at least one "lang" page
> > points to it). It appears that I would need to create a ScoringFilter for
> > this, and exploit the distributeScoreToOutlinks() and updateDbScore()
> > methods of the filter. However, before I embark on that journey, I thought
> > I'd ask if there is already a solution to this problem of a language
> > focused crawl in any Nutch plugin library somewhere, that supports an
> > extensive list of languages?
> >
> > Thanks,
> > Safdar
> >
> 

Mime
View raw message