nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Internal links appear to be external in Parse. Improvement of the crawling quality
Date Tue, 20 Feb 2018 20:40:51 GMT
Hello Semyon,
 
Yossi is right, you can use the db.ignore.* set of directives to resolve the problem.

Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This is,
of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use the
uncommitted ProtocolResolver for that to do it for you.

See: https://issues.apache.org/jira/browse/NUTCH-2247

If i remember it tomorrow afternoon, i can probably schedule some time to work on it the coming
seven days or so, and commit.

Regards,
Markus
 
-----Original message-----
> From:Yossi Tamari <yossi.tamari@pipl.com>
> Sent: Tuesday 20th February 2018 21:06
> To: user@nutch.apache.org
> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling
quality
> 
> Hi Semyon,
> 
> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue?
> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision
if this is the same domain.
> 
> 	Yossi.
> 
> > -----Original Message-----
> > From: Semyon Semyonov [mailto:semyon.semyonov@mail.com]
> > Sent: 20 February 2018 20:43
> > To: usernutch.apache.org <user@nutch.apache.org>
> > Subject: Internal links appear to be external in Parse. Improvement of the
> > crawling quality
> > 
> > Dear All,
> > 
> > I'm trying to increase quality of the crawling. A part of my database has
> > DB_FETCHED = 1.
> > 
> > Example, http://www.wincs.be/ in seed list.
> > 
> > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> > 
> > Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external
> > and therefore reject it.
> > 
> > 
> > If I insert http://wincs.be in seed file, everything works fine.
> > 
> > Do you think it is a good behavior? I mean, formally it is indeed two different
> > domains, but from user perspective it is exactly the same.
> > 
> > And if it is a default behavior, how can I fix it for my case? The same question
for
> > similar switch http -> https  etc.
> > 
> > Thanks.
> > 
> > Semyon.
> 
> 

Mime
View raw message