nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari" <yossi.tam...@pipl.com>
Subject RE: Internal links appear to be external in Parse. Improvement of the crawling quality
Date Tue, 20 Feb 2018 20:06:43 GMT
Hi Semyon,

Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue?
As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision if this
is the same domain.

	Yossi.

> -----Original Message-----
> From: Semyon Semyonov [mailto:semyon.semyonov@mail.com]
> Sent: 20 February 2018 20:43
> To: usernutch.apache.org <user@nutch.apache.org>
> Subject: Internal links appear to be external in Parse. Improvement of the
> crawling quality
> 
> Dear All,
> 
> I'm trying to increase quality of the crawling. A part of my database has
> DB_FETCHED = 1.
> 
> Example, http://www.wincs.be/ in seed list.
> 
> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> 
> Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external
> and therefore reject it.
> 
> 
> If I insert http://wincs.be in seed file, everything works fine.
> 
> Do you think it is a good behavior? I mean, formally it is indeed two different
> domains, but from user perspective it is exactly the same.
> 
> And if it is a default behavior, how can I fix it for my case? The same question for
> similar switch http -> https  etc.
> 
> Thanks.
> 
> Semyon.


Mime
View raw message