nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Fetching just some urls outside domain
Date Thu, 01 Dec 2011 22:59:16 GMT
Nutch comes packed with quite a few url-filters out of the box. They just
need some tuning.

Have a look in NUTCH_HOME/conf

Also have a look at the corresponding plugins. Realistically you should
really start a new thread for new questions :0)

I think you're looking for the urlfilter-domain plugin

On Thu, Dec 1, 2011 at 10:48 PM, <alxsss@aim.com> wrote:

> Hello,
>
> It is interesting to know how can one put a filter on outlinks? I mean if
> I have a regex, in which file should I put it?
> For example, I want nutch to ignore outlinks ending with .info.
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arkadi.Kosmynin <Arkadi.Kosmynin@csiro.au>
> To: user <user@nutch.apache.org>
> Sent: Thu, Dec 1, 2011 1:44 pm
> Subject: RE: Fetching just some urls outside domain
>
>
> Hi Adriana,
>
> You can try Arch for this:
>
> http://www.atnf.csiro.au/computing/software/arch
>
> You can configure it to crawl your web sites plus sets of miscellaneous
> URLs
> called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now,
> only
> Arch based on Nutch 1.2 is available for downloading. We are about to
> release
> Arch based on Nutch 1.4.
>
> Regards,
>
> Arkadi
>
>
>
> > -----Original Message-----
> > From: Adriana Farina [mailto:adriana.farina23@gmail.com]
> > Sent: Thursday, 1 December 2011 7:58 PM
> > To: user@nutch.apache.org
> > Subject: Re: Fetching just some urls outside domain
> >
> > Hi!
> >
> > Thank you for your answer. You're right, maybe an example would explain
> > better what I need to do.
> >
> > I have to perform the following task. I have to explore a specific
> > domain (.
> > gov.it) and I have an initial set of seeds, for example www.aaa.it,
> > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> > pages outside that domain. However some resources I need to download
> > (documents) are stored on web sites that are not inside the domain I'm
> > interested in.
> > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
> > (where
> > www.somesite.it is not inside "my" domain). Nutch will not fetch that
> > page
> > since I told it to behave that way, but I need to download documents
> > stored
> > on www.somesite.it. So I need nutch to go outside the domain I
> > specified
> > only when it sees the words "albi" or "albo" inside the url, since that
> > words identify the documents I need. How can I do this?
> >
> > I hope I've been clear. :)
> >
> >
> >
> > 2011/11/30 Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
> >
> > > Hi Adriana,
> > >
> > > This should be achievable through fine grained URL filters. It is
> > kindof
> > > hard to substantiate on this without you providing some examples of
> > the
> > > type of stuff you're trying to do!
> > >
> > > Lewis
> > >
> > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > > adriana.farina23@gmail.com
> > > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> > configured
> > > > it so that it doesn't fetch pages outside a specific domain.
> > However now
> > > I
> > > > need to let it fetch pages outside the domain I choosed but only
> > for some
> > > > urls (not for all the urls I have to crawl). How can I do this? I
> > have to
> > > > write a new plugin?
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
>
>
>


-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message