nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: Language-focused crawling
Date Sun, 01 Jul 2012 18:41:55 GMT
Hi

First of all you understand that in order to detect page language the page
must be crawled and at least sent to parser. As you admitted
language-identifier filter adds lang field and that's it.

You will need to modify or write your own filter that would discard
unwanted languages (return null).

scoring filters are something different and not suitable for the purpose.

As for indexing pages referenced by desired paged then the solution might
be is to add a flag to outlink metadata which then would be used to pass
the page through your filter.

This all is not really difficult if you have necessary programing skills
and strong desire. :)


Best Regards
Alexander Aristov


On 1 July 2012 17:00, Safdar Kureishy <safdar.kureishy@gmail.com> wrote:

> Hi,
>
> I would like to do a focused web crawl using Nutch, for all pages of a
> specific language - let's say "lang". However, the default
> language-identifier plugin from Nutch does not support this language.
>
> The heuristic I'd like to use is that I want all pages pointed to by pages
> containing "lang" content to be crawled, but pages that are pointed to by
> non-"lang" pages should not be crawled (unless at least one "lang" page
> points to it). It appears that I would need to create a ScoringFilter for
> this, and exploit the distributeScoreToOutlinks() and updateDbScore()
> methods of the filter. However, before I embark on that journey, I thought
> I'd ask if there is already a solution to this problem of a language
> focused crawl in any Nutch plugin library somewhere, that supports an
> extensive list of languages?
>
> Thanks,
> Safdar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message