Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 565DEDE07 for ; Sun, 8 Jul 2012 10:40:50 +0000 (UTC) Received: (qmail 71147 invoked by uid 500); 8 Jul 2012 10:40:49 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 70958 invoked by uid 500); 8 Jul 2012 10:40:46 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 70935 invoked by uid 99); 8 Jul 2012 10:40:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Jul 2012 10:40:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jvhaarst@gmail.com designates 209.85.214.178 as permitted sender) Received: from [209.85.214.178] (HELO mail-ob0-f178.google.com) (209.85.214.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Jul 2012 10:40:39 +0000 Received: by obbwd20 with SMTP id wd20so19081643obb.9 for ; Sun, 08 Jul 2012 03:40:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=G73FoPwbuJGI7pjIm6V2q5jFlfGSzzNZxslmnVRt5Ks=; b=REwTOOHIUKRIz+CTQarjzOML1Ob4DtDOW6x2mOEo6BrVaY5Zl9usESZDyp792LxBbl 313QxzqUM43VhYzAa7ViB5nNZY5EZ/fyi0QIigt9PD8GXcdyBpDdlaRjKZOKJt4kLLWz w4dAQ8Etx2OE2YeLTKp2Fu/hYPROEWqwj6zUul7QDstpqLZCUUrlf84icoo5HrU9UuPC kx/toWhN/m25Rqss88pAfxK+azVtEuZqEZIW4kEzyfPU2AWm0B8Vomfo3pUVew6FxDHQ baVqnzKtuDfoRNvNmZH1avHkHY9OTYNbUftkgZ3yZHPHpkxEbBMeA9cBRMjmX0v9Lh2d Gdwg== Received: by 10.50.208.8 with SMTP id ma8mr6147676igc.41.1341744017992; Sun, 08 Jul 2012 03:40:17 -0700 (PDT) MIME-Version: 1.0 Sender: jvhaarst@gmail.com Received: by 10.231.59.149 with HTTP; Sun, 8 Jul 2012 03:39:37 -0700 (PDT) In-Reply-To: References: From: Jan van Haarst Date: Sun, 8 Jul 2012 12:39:37 +0200 X-Google-Sender-Auth: _dzYqKN2Zy1w6axJrhqKEamxXeU Message-ID: Subject: Re: Crawling behind an ISA proxy (iis 7.5) To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary=14dae9340a5f926c1a04c44f1f09 --14dae9340a5f926c1a04c44f1f09 Content-Type: text/plain; charset=UTF-8 Dear All, We are now able to connect to the IIS proxy, thanks to the added logging facilities by Karl, we were able to see that this is the fix : Index: connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java =================================================================== --- connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java (revision 1357379) +++ connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java (working copy) @@ -361,7 +361,7 @@ String emailAddress = params.getParameter(WebcrawlerConfig.PARAMETER_EMAIL); if (emailAddress == null) throw new ManifoldCFException("Missing email address"); - userAgent = "ApacheManifoldCFWebCrawler; "+emailAddress+")"; + userAgent = "Mozilla/5.0 (ApacheManifoldCFWebCrawler; "+emailAddress+")"; from = emailAddress; x = params.getParameter(WebcrawlerConfig.PARAMETER_ROBOTSUSAGE); Yes, this is weird, a proxy shouldn't fail on User-Agent settings, but apparently this one does. Even Google apparently does this : http://www.useragentstring.com/pages/Googlebot/ Now, we 'just' have to get the crawling working, but the main (unique) hurdle has now been taken ! Karl, a big Thank You for your help, and for the openssl s_client that enabled us to debug this. Dag, Jan On Thu, Jun 28, 2012 at 11:05 PM, Jan van Haarst wrote: > On Thu, Jun 28, 2012 at 11:26 AM, Karl Wright wrote: > >> I was wondering if you'd picked up and tried the patch for >> CONNECTORS-483. This patch adds official proxy support for the Web >> Connector. Alternatively, you could try to build and run with trunk >> code. >> >> Karl >> > > I'm going the building from trunk way, and all seems to go well up to the > creation of the zip and tar.gz files. > Is there anything special to do after running the build process like this ? > > ant clean clean-core-deps clean-deps && ant make-core-deps make-deps build > && ant image > > Did I miss anything ? > If not, I'll replace the old binary installation with my source-build one, > and see where it leads me. > > -- > Dag, > Jan > -- Dag, Jan --14dae9340a5f926c1a04c44f1f09 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Dear All,

We are now able to connect to the IIS proxy, t= hanks to the added logging facilities by Karl, we were able to see that thi= s is the fix :

Index: connectors/webcrawler/connector/src/main/java/org/apache/man= ifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D
(revision 1357379)
+++ connectors/webcrawler/connec= tor/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/Webcr= awlerConnector.java (working copy)
@@ -361,7 +361,7 @@
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0Strin= g emailAddress =3D params.getParameter(WebcrawlerConfig.PARAMETER_EMAIL);
=C2=A0 =C2=A0 =C2=A0 = =C2=A0if (emailAddress =3D=3D null)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0throw new ManifoldCFException("Missing email address");=
- =C2=A0 =C2=A0 =C2=A0user= Agent =3D "ApacheManifoldCFWebCrawler; "+emailAddress+")&quo= t;;
+ =C2=A0 =C2=A0 =C2=A0userAgent = =3D "Mozilla/5.0 (ApacheManifoldCFWebCrawler; "+emailAddress+&quo= t;)";
=C2=A0 = =C2=A0 =C2=A0 =C2=A0from =3D emailAddress;
=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0x =3D params.getPa= rameter(WebcrawlerConfig.PARAMETER_ROBOTSUSAGE);

Yes, this is weird, a proxy shouldn't fail on User-Agent settings= , but apparently this one does.
Even Google apparently does this :=C2=A0http://www.useragentstring.com/pages/Googlebo= t/
Now, we 'just' have to get the crawling working, = =C2=A0but the main (unique) hurdle has now been taken !

Karl, a big Thank You for your help, and for the openss= l s_client that enabled us to debug this.

Dag,
Jan

On Thu, Jun 28, 2012 at 11:0= 5 PM, Jan van Haarst <jan@vanhaarst.net> wrote:
On Thu, Jun 28, 2012 at 11= :26 AM, Karl Wright <daddywri@gmail.com> wrote:
I was wondering if you'd picked up and tried the patch for
CONNECTORS-483. =C2=A0This patch adds official proxy support for the Web Connector. =C2=A0Alternatively, you could try to build and run with trunk code.

Karl

I'm going the building f= rom trunk way, and all seems to go well up to the creation of the zip and t= ar.gz files.
Is there anything special to do after running the bu= ild process like this ?

ant clean clean-core-deps clean-deps && ant make-core-deps mak= e-deps build && ant image=C2=A0

Did I miss= anything ?
If not, I'll replace the old binary installation = with my source-build one, and see where it leads me.

--
Dag,
Jan



-- Dag,
Jan
--14dae9340a5f926c1a04c44f1f09--