From user-return-33865-archive-asf-public=cust-asf.ponee.io@nutch.apache.org Tue Feb 20 21:41:02 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6BD51180654 for ; Tue, 20 Feb 2018 21:41:01 +0100 (CET) Received: (qmail 19955 invoked by uid 500); 20 Feb 2018 20:41:00 -0000 Mailing-List: contact user-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@nutch.apache.org Delivered-To: mailing list user@nutch.apache.org Received: (qmail 19936 invoked by uid 99); 20 Feb 2018 20:40:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Feb 2018 20:40:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CFCF6C0024 for ; Tue, 20 Feb 2018 20:40:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.11 X-Spam-Level: X-Spam-Status: No, score=-0.11 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, MIME_QP_LONG_LINE=0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=openindex.io Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id enw6fiPTpbAm for ; Tue, 20 Feb 2018 20:40:57 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (mail1.ams.nl.openindex.io [141.105.125.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 7CB225F1B3 for ; Tue, 20 Feb 2018 20:40:57 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id 4B5C9380D23 for ; Tue, 20 Feb 2018 20:40:51 +0000 (UTC) Received: from mail1.ams.nl.openindex.io ([127.0.0.1]) by localhost (mail1.ams.nl.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6puCzTFtnVFP for ; Tue, 20 Feb 2018 20:40:51 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id 1C9BD380D1F for ; Tue, 20 Feb 2018 20:40:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=openindex.io; s=mail; t=1519159251; bh=Mlo7nSvIyZX4IbzPQX1yETIqUe93lpXOHV7V10dk7jo=; h=Subject:From:To:Date:From; b=lveL3X86T9HGRaDIvbnpI0/VKGuv6CJoBhbkDo6pfuOvd7ZAAFGoi67CZh4dlS2RQ oEr5Xzq56u06nIchm3FnPyKcuWfGRm/2ng0BPbKYso5nogs8IL5HDO3+CQtRYSuGdp 4gCk6MGukS/i62cyAbay1F0aPk/5alxUnWfjhbIMK9uC6IjQa9fzEFdd0AD5nrDcx3 MWmhhMtWE87mfoVTODKFonPCin1fTeATkqwQ2RVq+5Hbx7/SuU/mIDWx4WdYJNMpAR Iu8QqR6igHQ3H+9g+hQfe8lwu4dRB/W/RUeDHNj1AzL8TQ7bKhThsxR1aFwIbXg09n XmsFE8Jj+q9sw== Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling quality From: =?utf-8?Q?Markus_Jelsma?= To: =?utf-8?Q?user=40nutch=2Eapache=2Eorg?= Date: Tue, 20 Feb 2018 20:40:51 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-Mailer: Zarafa 7.2.1-51838 X-Original-To: Message-Id: Hello Semyon, =20 Yossi is right, you can use the db.ignore.* set of directives to resolve the problem. Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to do it for you. See: https://issues.apache.org/jira/browse/NUTCH-2247 If i remember it tomorrow afternoon, i can probably schedule some time to work on it the coming seven days or so, and commit. Regards, Markus =20 -----Original message----- > From:Yossi Tamari > Sent: Tuesday 20th February 2018 21:06 > To: user@nutch.apache.org > Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling quality >=20 > Hi Semyon, >=20 > Wouldn't setting db.ignore.external.links.mode=3DbyDomain solve your wincs.be issue=3F > As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision if this is the same domain. >=20 > =09Yossi. >=20 > > -----Original Message----- > > From: Semyon Semyonov [mailto:semyon.semyonov@mail.com] > > Sent: 20 February 2018 20:43 > > To: usernutch.apache.org > > Subject: Internal links appear to be external in Parse. Improvement of the > > crawling quality > >=20 > > Dear All, > >=20 > > I'm trying to increase quality of the crawling. A part of my database has > > DB_FETCHED =3D 1. > >=20 > > Example, http://www.wincs.be/ in seed list. > >=20 > > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 > >=20 > > Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external > > and therefore reject it. > >=20 > >=20 > > If I insert http://wincs.be in seed file, everything works fine. > >=20 > > Do you think it is a good behavior=3F I mean, formally it is indeed two different > > domains, but from user perspective it is exactly the same. > >=20 > > And if it is a default behavior, how can I fix it for my case=3F The same question for > > similar switch http -> https etc. > >=20 > > Thanks. > >=20 > > Semyon. >=20 >=20