nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Fetching just some urls outside domain
Date Thu, 01 Dec 2011 20:17:52 GMT
If you also provide the settings from nutch-site.xml which restrict's your
Nutchbot from crawling outside some specified domain that would be helpful.

At this stage I think that if your restrictions completely deny Nutch from
following outlinks to other domains, then the use of reg-ex filters is
pointless. This is not what you wish to be configuring. Instead you want to
be allowing Nutch to crawl outlinks to other domains but limit which
domains you wish to crawl. I think it should be possible to add the filters
in your reg-ex file like

# accept the following but block everything else

+^http://([a-z0-9]*\.)*somesite.it/
+^http://([a-z0-9]*\.)*aaa.it/
+^http://([a-z0-9]*\.)*bbb.it/
etc

I don't think you will need to explicitly deny everything else. However
you'll only find out by doing a number of small test crawls to check out
whether your reg-ex filters are working

HTH

On Thu, Dec 1, 2011 at 8:57 AM, Adriana Farina
<adriana.farina23@gmail.com>wrote:

> Hi!
>
> Thank you for your answer. You're right, maybe an example would explain
> better what I need to do.
>
> I have to perform the following task. I have to explore a specific domain
> (.
> gov.it) and I have an initial set of seeds, for example www.aaa.it,
> www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> pages outside that domain. However some resources I need to download
> (documents) are stored on web sites that are not inside the domain I'm
> interested in.
> For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where
> www.somesite.it is not inside "my" domain). Nutch will not fetch that page
> since I told it to behave that way, but I need to download documents stored
> on www.somesite.it. So I need nutch to go outside the domain I specified
> only when it sees the words "albi" or "albo" inside the url, since that
> words identify the documents I need. How can I do this?
>
> I hope I've been clear. :)
>
>
>
> 2011/11/30 Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
>
> > Hi Adriana,
> >
> > This should be achievable through fine grained URL filters. It is kindof
> > hard to substantiate on this without you providing some examples of the
> > type of stuff you're trying to do!
> >
> > Lewis
> >
> > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > adriana.farina23@gmail.com
> > > wrote:
> >
> > > Hello,
> > >
> > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> configured
> > > it so that it doesn't fetch pages outside a specific domain. However
> now
> > I
> > > need to let it fetch pages outside the domain I choosed but only for
> some
> > > urls (not for all the urls I have to crawl). How can I do this? I have
> to
> > > write a new plugin?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message