Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA7259BAE for ; Tue, 15 May 2012 14:58:23 +0000 (UTC) Received: (qmail 57395 invoked by uid 500); 15 May 2012 14:58:23 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 57339 invoked by uid 500); 15 May 2012 14:58:23 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 57324 invoked by uid 99); 15 May 2012 14:58:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2012 14:58:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,T_FILL_THIS_FORM_SHORT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.161.175] (HELO mail-gg0-f175.google.com) (209.85.161.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2012 14:58:16 +0000 Received: by ggnp4 with SMTP id p4so1842285ggn.6 for ; Tue, 15 May 2012 07:57:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding:x-gm-message-state; bh=Y8WZr8ynhiTHMyUY8hPwIP+yAqERY35+s5tUgSYTuaw=; b=kLumC4c+9eMsDyiCQJdW68YYdz9EdEiRZHS2woaz/P45kofJfUDfuRxqmuZ9kdvZkT Yo8+pm3wuhGEBrL4JHqRpRUxFCsUPpDRu9GiDCsrUv3IoY+VkO1vdEVfDcWvVqeJd0xW YK2xFWybJFDAsbNOHLOKFv68THt/9sZL93qJ/A0YGEQfaM4BcGIyoZmKYJNbNvNbwuCE /dPH6V7EIgMA15QLzh69ZUmdxHZvkxeYtcg7e9rw1oVqrAnjrebKotAEpLBHpuHVluBS MQnrbDIK5qx3RyR+1Zh3iBbKT0cO0D9KrLILoTYVDXA6/3iy8yWTmNqMiGKkJowfzVFC Jeaw== MIME-Version: 1.0 Received: by 10.50.222.202 with SMTP id qo10mr7463676igc.0.1337093875037; Tue, 15 May 2012 07:57:55 -0700 (PDT) Received: by 10.231.139.220 with HTTP; Tue, 15 May 2012 07:57:55 -0700 (PDT) In-Reply-To: References: Date: Tue, 15 May 2012 16:57:55 +0200 Message-ID: Subject: Re: Crawling behind an ISA proxy (iis 7.5) From: Rene Nederhand To: connectors-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQmQhQ2Io0Kjq3lobhk/5oFHTuGR92KqPuvxss8QN3qR+/xWCr7fYK4S4A5PaTc3pEVRU+mf Hi Karl, Thank you so much for your detailed explanation. I am trying each step you've pointed out. Unfortunately, I cannot get this thing going. Hopefully, you can help me if I give you more detailed information. The sequence of steps is (when accessing https://bb.helo.hanze.nl): 1. https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=3DZ2F&reason=3D0&f= ormdir=3D3 This gives me indeed NTLM authentication. When I create a crawler that only crawls the above page I get a 200 response. So this works, no 401. 2. If I submit my username and password. This request is sent to the server. This is also the only form I'll ever see.: https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302) Request: curl Z2F flags 0 forcedownlevel 0 formdir 3 trusted 0 username loginname password mypassword SubmitCreds Log On 3. The response is a cookie being set with a redirect to the first url (but now with the cookie set) Response: HTTP/1.1 302 Moved Temporarily Location https://bb.helo.hanze.nl/ Set-Cookie noname=3D"2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PU= z1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93d= KChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"; HttpOnly; Domain=3D.hanze.nl; secure; path=3D/ Content-Length 0 Connection close Request: GET / HTTP/1.1 Host bb.helo.hanze.nl User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) Gecko/20100101 Firefox/12.0 Accept text/html,application/xhtml+xml,application/xml;q=3D0.9,*/*;q=3D0.8 Accept-Language en-us,en;q=3D0.5 Accept-Encoding gzip, deflate Connection keep-alive Referer https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=3DZ2F&reason= =3D0&formdir=3D3 Cookie noname=3D"2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iD= NkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChg= pwYenrI+uxJgTxiCprPhcRsNs3SYX1p9" 4. Lastly, a redirect is made to the Blackboard site (javascript check for cookie and redirect) Response:





You are being redirected to another page

Please Wait...



Click here to access the page to which you are being forwarded.
Although the first form used NTLM authentication, this doesn't work out. Therefore, I would think that session based auth would work better as I can create each step myself. I still haven't a clue how to approach this. What do I fill in those boxes? Thanks for helping me. Cheers, Ren=E9 On Fri, May 11, 2012 at 4:26 PM, Karl Wright wrote: > Hi Rene, > > Crawling through a proxy is usually easy, but crawling a session-based > site is always a challenge. > > ISA proxies usually authenticate with NTLM. =A0So you will want to set > up your web connection with NTLM authentication in order to even be > able to reach the pages. =A0It's not clear that you've got that right > yet, because if you don't have it right you will get 401 errors back. > Getting this right is a prerequisite; you won't be able to proceed > until it is correct. =A0To see that you do, try a very limited crawl > that fetches ONLY the login page (or some other un-session-protected > content). =A0If you get a 401 you'll need to figure out what's not right > before proceeding. > > It sounds like the site may also be secured using session-based > authentication. =A0If a cookie is involved then you need to configure > session auth in order to get to any session-protected pages. =A0The > trick is that, for session-based auth, you need to fully understand > the sequence of pages and forms that happen when a user visits the > site and is granted the cookie(s) - the login process, what content > URLs are protected, what URLs are part of the login sequence, etc. > The end-user documentation describes this in some detail. =A0It can be a > challenge to get it all set up right. > > Finally, for SharePoint sites, if you are intending to index > documents, you might well find the SharePoint Connector a better > choice than trying to crawl the site with the web connector. > > Thanks, > Karl > > On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand wro= te: >> Hi, >> >> I am trying to get ManifoldCF crawl our electronic learning >> environment (Blackboard). To enable single sign-on, our institution >> has placed an ISA server as proxy before Blackboard. >> This is giving me a lot of problems. >> >> I've managed to get passed the ISA server using session based >> authentication, but then I am stuck at a 401 error message. According >> to our architect, ISA is responsible for the communication with >> Blackboard and will set a cookie so Blackboard will know it a >> legitimate user is accessing its service. I think, ManifoldCF is not >> able to handle this cookie and hence is not able to access Blackboard. >> Am I right? If so, is there a possibility to get Blackboard indexed? >> >> By the way, the same authentication is used for our Sharepoint. I >> would like to index this as well.... >> >> Any help on solving this problem is appreciated. >> >> Cheers, >> >> Ren=E9