nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Filter by content language ID
Date Fri, 02 Dec 2011 15:49:51 GMT


On Friday 02 December 2011 16:23:42 contacts@complexityintelligence.com wrote:
> Hello everyone,
> 
> 
>    We've a set of urls to crawl, but we're interested in crawling only
> pages
> whose language is in our white list (e.g.: English, Italian, French),
> and reject all the others.
> 
> 
>    I don't know if Nutch has a built-in support for this,
> language-detector
> seems to be dedicated only to another task.
> 
You can use the field value added by the language detector to reject the page 
from being indexed. Create a custom indexing filter, skipping all documents 
you don't need.

> 
>    Which is the best way to achieve this with Nutch? Some configuration
> options, or it's needed to write a new plug-in ? (That for example,
> download
> the page, detect the content language, and if the language is ok,
> proceed,
> otherwise the page is skipped).
> 
> 
> Thanks,
> Alessio

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message