manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject Re: web connector : links extraction issues
Date Tue, 30 Oct 2018 11:05:13 GMT
Hi Karl,

Thanks for your answer.
I kept looking into this and I found what was the problem. The Javascript code into the tags
<script></scripts>  contained the character '<'. If so the links extraction
does not work with the web connector.

To reproduce it, I created this page hosted in local Apache then I indexed it with MCF 2.11
out of the box.

in the first example the page was :
<!DOCTYPE html>

<head>
<title>test</title>
<meta charset="utf-8" />
<script type="text/javascript"></script>

</head>
<body>

<a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
</body>

The links extraction was correct, in the debug log :
DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an HttpClient object
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For http://localhost:8888/testjs/test.html,
setting virtual host to localhost
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient object after 1
ms.
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for '/testjs/test.html'
 INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html'
is text, with encoding 'UTF-8'; link extraction starting
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 'http://localhost:8888/testjs/test.html',
found link to 'https://manifoldcf.apache.org/en_US/index.html'
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion rule supplied...
returning
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
—
In the second example, the code was pretty quite the same except that I included the character
'<' in the content of the script tags :
<!DOCTYPE html>

<head>
<title>test</title>
<meta charset="utf-8" />
<script type="text/javascript">a<b</script>

</head>
<body>

    <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
    
</body>

The links extraction was not successful, the debug log indicates :
DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an HttpClient object
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For http://localhost:8888/testjs/test.html,
setting virtual host to localhost
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient object after 1
ms.
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for '/testjs/test.html'
 INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html'
is text, with encoding 'UTF-8'; link extraction starting
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion rule supplied...
returning
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
—
So special characters like the less than sign should be escaped in the code of the web connector
to preserve the links extraction.

Thanks,
Best regards,


Olivier 

> Le 29 oct. 2018 à 19:39, Karl Wright <daddywri@gmail.com> a écrit :
> 
> Hi Olivier,
> 
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no Javascript is
executed at all.  Therefore it should not matter what is included via javascript.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <olivier.tavard@francelabs.com <mailto:olivier.tavard@francelabs.com>>
wrote:
> Hi,
> 
> Regarding the web connector, I noticed that for specific websites, some Javascript code
can prevent the web connector to fetch correctly all the links present on the page. Specifically,
for websites that contain a deprecated version of New relic web agent as js-agent.newrelic.com/nr-1071.min.js
<http://js-agent.newrelic.com/nr-1071.min.js>.
> After downloading the page locally and removing the reference to the new relic agent
browser, the links were correctly fetched in the page by the web connector. So it seems that
the Javascript injection here caused by the new relic agent was the cause of the links not
fetched in the page.
> This case is rare and concerns only old versions of New Relic agent. But in a more generic
way, would it be possible to block the javascript injection at the connector level during
the indexation ?
>  
> Thanks,
> Best regards,
> Olivier 
> 
> 


Mime
View raw message