From user-return-5553-archive-asf-public=cust-asf.ponee.io@manifoldcf.apache.org Thu Nov 15 12:57:28 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9A945180669 for ; Thu, 15 Nov 2018 12:57:27 +0100 (CET) Received: (qmail 13261 invoked by uid 500); 15 Nov 2018 11:57:26 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 13251 invoked by uid 99); 15 Nov 2018 11:57:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Nov 2018 11:57:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 46577CD45E for ; Thu, 15 Nov 2018 11:57:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.34 X-Spam-Level: X-Spam-Status: No, score=0.34 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-1.459, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Q-ovaDr3H9b8 for ; Thu, 15 Nov 2018 11:57:24 +0000 (UTC) Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id DFD256125D for ; Thu, 15 Nov 2018 11:57:23 +0000 (UTC) Received: by mail-lf1-f46.google.com with SMTP id h192so13946840lfg.3 for ; Thu, 15 Nov 2018 03:57:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Xmh7bBUWEvGRsmDMwubrhW0YgPmdECmyQp63FINSHdQ=; b=GeNlhc/3oZap0p3QxupJmaUwKUnvu/C+0+57x1cimldrhZZKrKPTI7ohPRYl804w4K GzA4ojSOJqrv6yO1XMBV5VR4pAryed+zxEV4oMHMWvLs0q0ZBta4716SsK+6sfGgbKp4 NSirvZXScxF3VP5SYpIUIQbI4Vpw4WiuGQNC4haRa7v4hKP//O0jUV6dEBeSnlT7OP3k KTQshJs5OXZsI8RmcuYzxGAVvo7G9D2YEdVQo/g36Qhsxa/b24dq8DEnryFLGI4PF46D ERIu7lvi8yhJahniUf9fb0/2u28MYssPIVagXgLpSbG/FrPzYkelXKUWM8istYLJsy30 CDEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Xmh7bBUWEvGRsmDMwubrhW0YgPmdECmyQp63FINSHdQ=; b=fmsA2jPc15OHF0J6LEc7uwnHNQK/l1K9eOlMvgF4Lkk4EBes8C4gMQD4QzVdnHBMTe Y2ekoYUG1LgYWnZ1+SCMtYYpUMizk+N4NqsZ8UlZ65qcYwwbdcwpJKEdiafHJcuGw8un 1Z6NjkPetdKtlm8iUq+U70Eve+ObWjafDxrRv3XW12po8KfLeRacnjtjaf8RgqCTlaJG e6WSsU9+P1ftCUOmv3NBvaSKkVpUDrLziSZVL+6h4XVdIwJyG4ywqeV62lVDqtKqZDQs T8UL/BeT5x3z9yyCyu7HqtFgfGoHefTmmTOMCV7M9oSTb+9LnTT7iIY24QYGt2aKv68B Y5TA== X-Gm-Message-State: AGRZ1gICZh00R4lsUHJVIS4vyoBn/yNgTYF0tAo/jOv46Pz1wZREcsDe H75iM9IN51pnzKXrq5msm8Ru6g892y3gsPHDjLwYpatz X-Google-Smtp-Source: AJdET5c9Zf+KgQvJIFxJvcC0p857Lf7C4sr71MmO1ekHP+wEMxoF4n5VLIheTZ+DTRPSO/oEgqruWeajhNTH8NQ/ul8= X-Received: by 2002:a19:2b54:: with SMTP id r81mr3472009lfr.34.1542283042171; Thu, 15 Nov 2018 03:57:22 -0800 (PST) MIME-Version: 1.0 References: <3491FCC9-BC9B-4D46-BBCA-666A60003E5E@francelabs.com> <820D43F2-2F94-4D2E-A5A7-39E67BCF21FF@francelabs.com> In-Reply-To: <820D43F2-2F94-4D2E-A5A7-39E67BCF21FF@francelabs.com> From: Karl Wright Date: Thu, 15 Nov 2018 06:57:09 -0500 Message-ID: Subject: Re: web connector : links extraction issues To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="000000000000df3161057ab2c008" --000000000000df3161057ab2c008 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Olivier, You can create a ticket but I don't have a good solution for you in any case. Karl On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard < olivier.tavard@francelabs.com> wrote: > Hi Karl, > > Do you think that I need to create a Jira issue relative to this bug ie > that the links extraction does not work if inside Javascript tags some co= de > contain special characters like '>', '< '? > > Thanks, > Best regards, > > Olivier > > > > Le 30 oct. 2018 =C3=A0 12:05, Olivier Tavard a > =C3=A9crit : > > Hi Karl, > > Thanks for your answer. > I kept looking into this and I found what was the problem. The Javascript > code into the tags * > > > > > manifoldcf > > > The links extraction was correct, in the debug log : > DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For > http://localhost:8888/testjs/test.html, setting virtual host to localhost > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an > HttpClient object after 1 ms. > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL| > http://localhost:8888/testjs/test.html|1540896372585+75|200|223| > > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http:= //localhost:8888/testjs/test.html' > is text, with encoding 'UTF-8'; link extraction starting > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html documen= t > 'http://localhost:8888/testjs/test.html', found link to > 'https://manifoldcf.apache.org/en_US/index.html' > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to > ingest 'http://localhost:8888/testjs/test.html' > =E2=80=94 > In the second example, the code was pretty quite the same except that I > included the character '<' in the content of the script tags : > > > > test > > ** > > > > > ">manifoldcf > > > > The links extraction was not successful, the debug log indicates : > DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For > http://localhost:8888/testjs/test.html, setting virtual host to localhost > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an > HttpClient object after 1 ms. > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL| > http://localhost:8888/testjs/test.html|1540896493475+76|200|226| > > DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http:= //localhost:8888/testjs/test.html' > is text, with encoding 'UTF-8'; link extraction starting > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to > ingest 'http://localhost:8888/testjs/test.html' > =E2=80=94 > So special characters like the less than sign should be escaped in the > code of the web connector to preserve the links extraction. > > Thanks, > Best regards, > > > Olivier > > Le 29 oct. 2018 =C3=A0 19:39, Karl Wright a =C3=A9cr= it : > > Hi Olivier, > > Javascript inclusion in the Web Connector is not evaluated. In fact, no > Javascript is executed at all. Therefore it should not matter what is > included via javascript. > > Thanks, > Karl > > > On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard < > olivier.tavard@francelabs.com> wrote: > >> Hi, >> >> Regarding the web connector, I noticed that for specific websites, some >> Javascript code can prevent the web connector to fetch correctly all the >> links present on the page. Specifically, for websites that contain a >> deprecated version of New relic web agent as >> js-agent.newrelic.com/nr-1071.min.js. >> After downloading the page locally and removing the reference to the new >> relic agent browser, the links were correctly fetched in the page by the >> web connector. So it seems that the Javascript injection here caused by >> the new relic agent was the cause of the links not fetched in the page. >> This case is rare and concerns only old versions of New Relic agent. But >> in a more generic way, would it be possible to block the javascript >> injection at the connector level during the indexation ? >> >> Thanks, >> Best regards, >> Olivier >> >> >> > > --000000000000df3161057ab2c008 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Olivier,

You can create a ticket but= I don't have a good solution for you in any case.

=
Karl


On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <olivier.tavard@francelabs.com> wro= te:
Hi Karl,

Do you think that I need to create a Jira issue relative to= this bug ie that the links extraction does not work if inside Javascript t= ags some code contain special characters like '>', '< = 9;?

Thanks,
Best regards,
Olivier



Le 30 oct. 2018 =C3=A0 12:05, Olivi= er Tavard <olivier.tavard@francelabs.com> a =C3=A9crit :

Hi Karl,

T= hanks for your answer.
I kept looking into this and I found what = was the problem. The Javascript code into the tags <script></scrip= ts> =C2=A0contained the character '<'. If so the links extrac= tion does not work with the web connector.

To repr= oduce it, I created this page hosted in local Apache then I indexed it with= MCF 2.11 out of the box.

in the first example the= page was :
<!DOCTYPE= =C2=A0html>

<head&g= t;
<title>test</titl= e>
<meta=C2=A0charset= =3D"utf-8"=C2=A0/>
<script=C2=A0type=3D"text/javascript"></script>=

</head>
<= div style=3D"font-family:HelveticaNeue"><body>

<a=C2=A0href=3D"https://manifoldcf.ap= ache.org/en_US/index.html">manifoldcf</a>
</body>

The links extraction was correct, in= the debug log :
DEBUG 2= 018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an H= ttpClient object
DEBUG 2018-10-30T11:46:12,585 (Worker thread = 9;33') - WEB: For http://localhost:8888/testjs/test.html, setting virtual= host to localhost
DEBUG 2018-10-30T11:46:12,585 (Worker thread &= #39;33') - WEB: Got an HttpClient object after 1 ms.
DEBUG 20= 18-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for &#= 39;/testjs/test.html'
=C2=A0INFO 2018-10-30T11:46:12,661 (Wor= ker thread '33') - WEB: FETCH URL|http:= //localhost:8888/testjs/test.html|1540896372585+75|200|223|
D= EBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document &= #39;http://localhost:8888/testjs= /test.html' is text, with encoding 'UTF-8'; link extraction= starting
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33= 9;) - WEB: In html document 'http://localhost:8888/testjs/test.html'= , found link to 'https://manifoldcf.apache.org/en_US/index.html'= ;
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WE= B: no content exclusion rule supplied... returning
DEBUG 2018-10-= 30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest '= http:/= /localhost:8888/testjs/test.html'
=E2=80=94
= In the second example, the code was pretty quite the same except that I inc= luded the character '<' in the content of the script tags :
<!DOCTYPE=C2=A0html>

<he= ad>
<title>test</title>
<meta=C2=A0charset=3D"= utf-8"=C2=A0/>
<script=C2=A0type=3D"text/javascript&q= uot;>a<b</script>

</head>
<body>
<= br>=C2=A0 =C2=A0=C2=A0<a=C2=A0href=3D"https://manifoldcf.apache.o= rg/en_US/index.html">manifoldcf</a>
=C2=A0 =C2=A0=C2= =A0
</body>
The links extraction was n= ot successful, the debug log indicates :
DEBUG 2018-10-30T11:48:13,474 (Worker thread '36')= - WEB: Waiting for an HttpClient object
DEBUG 2018-10-30T11:48:1= 3,475 (Worker thread '36') - WEB: For http://localhost:8888/testjs/test.h= tml, setting virtual host to localhost
DEBUG 2018-10-30T11:48= :13,475 (Worker thread '36') - WEB: Got an HttpClient object after = 1 ms.
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') = - WEB: Get method for '/testjs/test.html'
=C2=A0INFO 2018= -10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896493475+7= 6|200|226|
DEBUG 2018-10-30T11:48:13,552 (Worker thread '= 36') - WEB: Document 'ht= tp://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8= '; link extraction starting
DEBUG 2018-10-30T11:48:13,553= (Worker thread '36') - WEB: no content exclusion rule supplied... = returning
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36= 9;) - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'=
=E2=80=94
So special characters like the less than sig= n should be escaped in the code of the web connector to preserve the links = extraction.

Thanks,
Best regards,
<= /font>


Olivier=C2=A0

Le 29 oct. 2018 =C3=A0 19:39, Karl = Wright <daddywri= @gmail.com> a =C3=A9crit :

Hi Olivier,

Javascript inclusion in the Web Connector is not evaluated.=C2=A0 In fact= , no Javascript is executed at all.=C2=A0 Therefore it should not matter wh= at is included via javascript.

Thanks,
K= arl


On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <olivier.tavard@francelabs.com= > wrote:
Hi,

Regard= ing the web connector, I noticed that for specific websites, some Javascrip= t code can prevent the web connector to fetch correctly all the links prese= nt on the page. Specifically, for websites that contain a deprecated versio= n of New relic web agent as=C2=A0= j= s-agent.newrelic.com/nr-1071.min.js.
After downloading the page locally and removing the reference t= o the new relic agent browser, the links were correctly fetched in the page by the web connector.=C2= =A0So it seems that the Ja= vascript injection here caused by the new relic agent was the cause of the = links not fetched in the page.
This case is rare and concerns only old versions of New Reli= c agent. But in a more generic way, would it be possible to block the javas= cript injection at the connector level during the indexation ?
=
=C2=A0
Thanks,
Best rega= rds,
Olivier=C2=A0




--000000000000df3161057ab2c008--