manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Schuch" <markus_sch...@web.de>
Subject Re: webcrawler connector and dns lookups behind corporate http proxy
Date Tue, 11 Oct 2016 22:11:46 GMT
Hi Karl,
 
thanks for the suggestion. I tried it but the crawled website sends 301 redirects to the
canonical hostname when requesting pages directly via ip address - which leads again to the
ip lookup.
Guess i'm stuck with the /etc/hosts solution then. This will get messy if the ip changes often.

I'm interested to understand the mechanics of the crawler better: what is the reason for resolving
the IP addresses instead of using the Hostnamen?
 
Thanks
Markus
 

Gesendet: Montag, 10. Oktober 2016 um 22:00 Uhr
Von: "Karl Wright" <daddywri@gmail.com>
An: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Betreff: Re: webcrawler connector and dns lookups behind corporate http proxy

If the proxy is not authenticated, I think you can just put the IP address in as the machine
name and it should work.  But that's all I can think of.
 
Karl
 
 
On Mon, Oct 10, 2016 at 3:44 PM, Markus Schuch <markus_schuch@web.de[mailto:markus_schuch@web.de]>
wrote:Hi @ the lovely mcf community out there,
 
in our setup we run manifoldcf (2.3) behind a corporate http proxy server and we try to crawl
specific web pages in the internet.
 
We run into java.net[http://java.net].UnknownHostException because the connector tries to
resolve the ip of the hostname. This fails, because our network setup does not allow direct
dns lookups for internet pages and the JDKs InetAddress.getByName() call relies on the systems
dns lookup mechanisms. All internet traffic goes through the corporate http proxy server which
does all necessary dns resolution on his side.
 
Can you think of any other (more elegant) solution besides adding the records to /etc/hosts
on the crawlers machine?
 
Many thanks in advance,
Markus
 
 

Mime
View raw message