nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semyon Semyonov" <semyon.semyo...@mail.com>
Subject Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Date Wed, 21 Feb 2018 12:52:53 GMT
Hi Sabastian,

If I
- modify the method URLUtil.getDomainName(URL url)

doesn't it mean that I don't need 
 - set db.ignore.external.links.mode=byDomain

anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.


To make it as generic as possible I can create an issue/pull request for this, but I would
like to hear your suggestion about the best way to do so.
1) Do we have a config setting that we can use already?
2) The domain discussion[1] is quite wide though. In my case I cover only one issue with the
mapping www -> _ . It looks more like same Host problem rather than the same Domain problem.
What to you think about such host resolution?
3) Where this problem should be solved? Only in ParseOutputFormat.java or somewhere else as
well?

Semyon.


 

Sent: Wednesday, February 21, 2018 at 11:51 AM
From: "Sebastian Nagel" <wastl.nagel@googlemail.com>
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi Semyon,

> interpret www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one
host?

Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.

For a quick work-around:
- set db.ignore.external.links.mode=byDomain
- modify the method URLUtil.getDomainName(URL url)
so that it returns the hostname with www. stripped

For a final solution we could make it configurable
which method or class is called. Since the definition of "domain"
is somewhat debatable [1], we could even provide alternative
implementations.

> PS. For me it is not really clear how ProtocolResolver works.

It's only a heuristics to avoid duplicates by protocol (http and https).
If you care about duplicates and cannot get rid of them afterwards by a deduplication job,
you may have a look at urlnormalizer-protocol and NUTCH-2447.

Best,
Sebastian


[1] https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]

On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
> Thanks Yossi, Markus,
>
> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>
> I crawl specific hosts only therefore I have a finite number of hosts to crawl.
> Lets say, www.somewebsite.com[http://www.somewebsite.com]
>
> I want to stay limited with this host. In other words, neither www.art.somewebsite.com[http://www.art.somewebsite.com]
nor www.sport.somewebsite.com[http://www.sport.somewebsite.com].
> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = true(no external
websites).
>
> Although, I want to get the links that seem to belong to the same host(www.somewebsite.com[http://www.somewebsite.com]
-> somewebsite.com/games, without www).
> The question is shouldn't we include it as a default behavior(or configured behavior)
in Nutch and interpret www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com
as one host?
>
>
>
> PS. For me it is not really clear how ProtocolResolver works.
>
> Semyon
>
>
>  
>
> Sent: Tuesday, February 20, 2018 at 9:40 PM
> From: "Markus Jelsma" <markus.jelsma@openindex.io>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling
quality
> Hello Semyon,
>
> Yossi is right, you can use the db.ignore.* set of directives to resolve the problem.
>
> Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This
is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use
the uncommitted ProtocolResolver for that to do it for you.
>
> See: https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]
>
> If i remember it tomorrow afternoon, i can probably schedule some time to work on it
the coming seven days or so, and commit.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:Yossi Tamari <yossi.tamari@pipl.com>
>> Sent: Tuesday 20th February 2018 21:06
>> To: user@nutch.apache.org
>> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling
quality
>>
>> Hi Semyon,
>>
>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue?
>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision
if this is the same domain.
>>
>> Yossi.
>>
>>> -----Original Message-----
>>> From: Semyon Semyonov [mailto:semyon.semyonov@mail.com]
>>> Sent: 20 February 2018 20:43
>>> To: usernutch.apache.org <user@nutch.apache.org>
>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>> crawling quality
>>>
>>> Dear All,
>>>
>>> I'm trying to increase quality of the crawling. A part of my database has
>>> DB_FETCHED = 1.
>>>
>>> Example, http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]
in seed list.
>>>
>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>
>>> Nutch considers one of the link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]])
as external
>>> and therefore reject it.
>>>
>>>
>>> If I insert http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]
in seed file, everything works fine.
>>>
>>> Do you think it is a good behavior? I mean, formally it is indeed two different
>>> domains, but from user perspective it is exactly the same.
>>>
>>> And if it is a default behavior, how can I fix it for my case? The same question
for
>>> similar switch http -> https etc.
>>>
>>> Thanks.
>>>
>>> Semyon.
>>
>>
 

Mime
View raw message