nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Filter by content language ID
Date Tue, 13 Dec 2011 09:41:04 GMT
Like i said, create an indexing filter. The example on the wiki is very simply 
and clear. Just check the field created by the langid plugin and decide what 
to do with it. The field, when the plugin is present, is automatically added 
to NutchDocument which are passed through indexing filters and later on 
transformed to SolrDocument obj.

> Hello,
> 
>    After a lot of searching, i was unable to find update (Nutch1.4) info
> about how to use language id for filtering. Some info are very outdated,
> and doesn't work at all with Nutch 1.4.
> 
>    Basically we're testing Nutch for crawling 10M+ web pages, but we want
> to deal only with pages that are in EN,IT,DE,FR language, and skip others.
> In addition, when indexing with Solr, we need to store the field regarding
> the language id, to use it as a query filter (e.g.: "Only pages in XX
> language that contain Y").
> 
>    We're new to Nutch, but this seems to be a very common pattern, but as
> stated, I was unable to find any update documentation. I think the
> solution may be useful to many.
> 
>    Please, point me to a related resource or hint to solve this task. I'm
> very happy to add this solution to the Wiki if it is possible.
> 
> Thanks,
> Alessio
>  -------- Original Message --------
>  Subject: Re: Filter by content language ID
>  From: Markus Jelsma <markus.jelsma@openindex.io>
>  Date: Fri, December 02, 2011 8:49 am
>  To: user@nutch.apache.org
> 
>  On Friday 02 December 2011 16:23:42 contacts@complexityintelligence.com 
wrote:
>  > Hello everyone,
>  > 
>  > 
>  > We've a set of urls to crawl, but we're interested in crawling only
>  > pages
>  > whose language is in our white list (e.g.: English, Italian, French),
>  > and reject all the others.
>  > 
>  > 
>  > I don't know if Nutch has a built-in support for this,
>  > language-detector
>  > seems to be dedicated only to another task.
> 
>  You can use the field value added by the language detector to reject the
> page from being indexed. Create a custom indexing filter, skipping all
> documents you don't need.
> 
>  > Which is the best way to achieve this with Nutch? Some configuration
>  > options, or it's needed to write a new plug-in ? (That for example,
>  > download
>  > the page, detect the content language, and if the language is ok,
>  > proceed,
>  > otherwise the page is skipped).
>  > 
>  > 
>  > Thanks,
>  > Alessio

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message