nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Date Wed, 21 Feb 2018 10:51:16 GMT
Hi Semyon,

> interpret www.somewebsite.com and somewhebsite.com as one host?

Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.

For a quick work-around:
- set db.ignore.external.links.mode=byDomain
- modify the method URLUtil.getDomainName(URL url)
  so that it returns the hostname with www. stripped

For a final solution we could make it configurable
which method or class is called. Since the definition of "domain"
is somewhat debatable [1], we could even provide alternative
implementations.

> PS. For me it is not really clear how ProtocolResolver works.

It's only a heuristics to avoid duplicates by protocol (http and https).
If you care about duplicates and cannot get rid of them afterwards by a deduplication job,
you may have a look at urlnormalizer-protocol and NUTCH-2447.

Best,
Sebastian


[1] https://github.com/google/guava/wiki/InternetDomainNameExplained

On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
> Thanks Yossi, Markus,
> 
> I have an issue with the db.ignore.external.links.mode=byDomain solution.
> 
> I crawl specific hosts only therefore I have a finite number of hosts to crawl.
> Lets say, www.somewebsite.com
> 
> I want to stay limited with this host. In other words, neither www.art.somewebsite.com
nor www.sport.somewebsite.com.
> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = true(no external
websites).
> 
> Although, I want to get the links that seem to belong to the same host(www.somewebsite.com
-> somewebsite.com/games, without www).
> The question is shouldn't we include it as a default behavior(or configured behavior)
in Nutch and interpret www.somewebsite.com and somewhebsite.com as one host?
> 
> 
> 
> PS. For me it is not really clear how ProtocolResolver works.
> 
> Semyon
> 
> 
>  
> 
> Sent: Tuesday, February 20, 2018 at 9:40 PM
> From: "Markus Jelsma" <markus.jelsma@openindex.io>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling
quality
> Hello Semyon,
> 
> Yossi is right, you can use the db.ignore.* set of directives to resolve the problem.
> 
> Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This
is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use
the uncommitted ProtocolResolver for that to do it for you.
> 
> See: https://issues.apache.org/jira/browse/NUTCH-2247
> 
> If i remember it tomorrow afternoon, i can probably schedule some time to work on it
the coming seven days or so, and commit.
> 
> Regards,
> Markus
> 
> -----Original message-----
>> From:Yossi Tamari <yossi.tamari@pipl.com>
>> Sent: Tuesday 20th February 2018 21:06
>> To: user@nutch.apache.org
>> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling
quality
>>
>> Hi Semyon,
>>
>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue?
>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision
if this is the same domain.
>>
>> Yossi.
>>
>>> -----Original Message-----
>>> From: Semyon Semyonov [mailto:semyon.semyonov@mail.com]
>>> Sent: 20 February 2018 20:43
>>> To: usernutch.apache.org <user@nutch.apache.org>
>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>> crawling quality
>>>
>>> Dear All,
>>>
>>> I'm trying to increase quality of the crawling. A part of my database has
>>> DB_FETCHED = 1.
>>>
>>> Example, http://www.wincs.be/[http://www.wincs.be/] in seed list.
>>>
>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>
>>> Nutch considers one of the link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html])
as external
>>> and therefore reject it.
>>>
>>>
>>> If I insert http://wincs.be[http://wincs.be] in seed file, everything works fine.
>>>
>>> Do you think it is a good behavior? I mean, formally it is indeed two different
>>> domains, but from user perspective it is exactly the same.
>>>
>>> And if it is a default behavior, how can I fix it for my case? The same question
for
>>> similar switch http -> https etc.
>>>
>>> Thanks.
>>>
>>> Semyon.
>>
>>


Mime
View raw message