manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rene Nederhand <r...@nederhand.net>
Subject Re: Crawling behind an ISA proxy (iis 7.5)
Date Wed, 16 May 2012 14:23:35 GMT
Hi Karl,

Thank you so much for putting a so much time in educating a newbe. I
appreciate your help enormously.

I'd tried to follow each of the steps below. So far, it doesn't work but I
will continue this evening to see if I can get this thing going.

In the mean time, I have switched loglevels of the crawling proces to
"INFO" and found something interesting in the logs. Perhaps, this could
shine some light on my issues:

ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic
org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid
challenge: Basic
at
org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown
Source)
at
org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown
Source)
at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown
Source)
at
org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown
Source)
at
org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown
Source)
at
org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown
Source)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
Source)
at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)

Please not that I have set NTLM (not BASIC) authentication on "
bb.helo.hanze.nl" and nothing else. The error does not occur when I try to
crawl our intranet (also with NTLM). Does this mean something? At least, I
think it is the source of the 401 I get when looking at the simple report,
isn't it?

In addition, I've used Charles proxy to monitor all interaction between my
browser and the server. I have found that it doesn't matter which url I use
to enter Blackboard, they get all directed to
https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based
authentication handle this?

To make the information complete, I've added the HAR file with the
CharlesProxy output. It can be displayed at
http://www.softwareishard.com/har/viewer/ for example. You'll be able to
see all requests/responses when I start with a clean browser (cookies
removed) entering https://bb.helo.hanze.nl. Maybe, this does help.

Again, thanks a lot for your help!

René





On Tue, May 15, 2012 at 5:59 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Rene,
>
> You will need both NTLM auth (page auth, which you have already set
> up), and Session auth (which you haven't yet set up).
>
> In order to set up session-based auth, you should first identify the
> set of pages that you want access to that are protected by a cookie
> requirement.  You will need to write a regular expression that matches
> these pages and ONLY these pages.  This URL gets entered as the "URL
> regular expression" on the Access Credentials tab in the Session-based
> Access Credentials part of the tab.  Then, click the Add button.
>
> The next thing you will need is to specify how the connector
> recognizes pages that belong to the logon sequence.  The actual
> sequence you need to understand is what happens in the browser when
> you try to access a specific protected URL and you don't have the
> right cookie.  You did not actually specify that; I think you are
> presuming that you'd be entering directly through the logon page, but
> that is not how it works.  The crawler will have a URL in mind and
> will need access to the content of that URL.  It will fetch the URL,
> and if the actual content is NOT fetched, we need to detect that
> situation and consider it part of the logon sequence.
>
> So let's pretend that what happens when the cookie is not present is
> that you get a redirection to the logon page, instead of the actual
> page content.  In that case, you would create a login sequence page
> description consisting of the same URL regular expression that
> describes the protected content pages, plus the "redirection" radio
> button, plus a target URL regular expression that would match
> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  You then click the Add
> button for login pages to add that description to the set of login
> pages.
>
> Next, the GetLogon page itself needs to be added as a login sequence
> page.  The regular expression should match only
> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  The type of the page is
> "form" because you said this was a form where you could fill in your
> login credentials.  If there is only one form on the page you can
> leave the regexp that matches the form name blank since that will
> match everything.  Once you click "Add" for this page, you will have
> the opportunity to fill in form names and values to post when the form
> gets posted.
>
> It was not clear from your description, once again, what happens after
> the Logon page is posted.  If there is a special target page, you need
> to include that also in the login sequence so that its content is not
> taken.  If there is a redirection back to the original content page,
> you'd include that redirection.
>
> Hopefully this is beginning to make a bit of sense to you; but this is
> the general picture, not related to your actual site that closely.
> For example, the Javascript redirection you mentioned will not be
> processed by ManifoldCF, but that is unnecessary because at the end of
> the whole login sequence ManifoldCF automatically goes back to the
> original URL when the login sequence is chased to its end.  So all you
> need to do is make sure that all pages that are part of that sequence
> are specified.
>
> On the other hand, it's not clear that the code you have "protecting"
> the site sets cookies any other way than through Javascript.  The
> cookie that this Javascript actually sets is a really stupid
> non-specific cookie, but unless it is set by the standard response
> header method, I don't think it's going to wind up being set at all.
> Can you confirm that this is the only way the cookie gets set?
>
> Karl
>
> On Tue, May 15, 2012 at 10:57 AM, Rene Nederhand <rene@nederhand.net>
> wrote:
> > Hi Karl,
> >
> > Thank you so much for your detailed explanation. I am trying  each
> > step you've pointed out. Unfortunately, I cannot get this thing going.
> > Hopefully, you can help me if I give you more detailed information.
> >
> > The sequence of steps is (when accessing https://bb.helo.hanze.nl):
> >
> > 1.
> https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
> > This gives me indeed NTLM authentication. When I create a crawler that
> > only crawls the above page I get a 200 response. So this works, no
> > 401.
> >
> > 2. If I submit my username and password. This request is sent to the
> > server. This is also the only form I'll ever see.:
> >
> > https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302)
> > Request:
> > curl    Z2F
> > flags   0
> > forcedownlevel  0
> > formdir 3
> > trusted 0
> > username        loginname
> > password        mypassword
> > SubmitCreds     Log On
> >
> > 3. The response is a cookie being set with a redirect to the first url
> > (but now with the cookie set)
> >
> > Response:
> >        HTTP/1.1 302 Moved Temporarily
> > Location        https://bb.helo.hanze.nl/
> > Set-Cookie
>  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9";
> > HttpOnly; Domain=.hanze.nl; secure; path=/
> > Content-Length  0
> > Connection      close
> >
> > Request:
> >        GET / HTTP/1.1
> > Host    bb.helo.hanze.nl
> > User-Agent      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0)
> > Gecko/20100101 Firefox/12.0
> > Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > Accept-Language en-us,en;q=0.5
> > Accept-Encoding gzip, deflate
> > Connection      keep-alive
> > Referer
> https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
> > Cookie
>  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"
> >
> > 4. Lastly, a redirect is made to the Blackboard site (javascript check
> > for cookie and redirect)
> >
> > Response:
> > <HTML dir='ltr'><HEAD>
> > <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
> > HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
> > <script language="Javascript">
> >  cookie_name = "cookies_enabled";
> >  document.cookie=cookie_name+"=yes";
> >  if (!document.cookie) {
> >    document.location.href="/nocookies.html";
> >  }
> >  document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01 GMT";
> > </script>
> > <SCRIPT language="Javascript"><!--
> > document.location.replace('
> https://bb.helo.hanze.nl/webapps/portal/frameset.jsp');
> > //--></SCRIPT></HEAD>
> > <BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'>
> > <br><br><br><br><div style="text-align: center;"><hr
width='350'
> height='5'><br>
> > <strong>You are being redirected to another page</strong>
> > <p><strong>Please Wait...</strong><br><br><hr width='350'
height='5'>
> > <br><A HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp
> '><strong>Click
> > here to access the page to which you are being
> > forwarded.</strong></A></div>
> > </BODY></HTML>
> >
> > Although the first form used NTLM authentication, this doesn't work
> > out. Therefore, I would think that session based auth would work
> > better as I can create each step myself. I still haven't a clue how to
> > approach this. What do I fill in those boxes?
> >
> > Thanks for helping me.
> >
> > Cheers,
> > René
> >
> >
> >
> >
> > On Fri, May 11, 2012 at 4:26 PM, Karl Wright <daddywri@gmail.com> wrote:
> >> Hi Rene,
> >>
> >> Crawling through a proxy is usually easy, but crawling a session-based
> >> site is always a challenge.
> >>
> >> ISA proxies usually authenticate with NTLM.  So you will want to set
> >> up your web connection with NTLM authentication in order to even be
> >> able to reach the pages.  It's not clear that you've got that right
> >> yet, because if you don't have it right you will get 401 errors back.
> >> Getting this right is a prerequisite; you won't be able to proceed
> >> until it is correct.  To see that you do, try a very limited crawl
> >> that fetches ONLY the login page (or some other un-session-protected
> >> content).  If you get a 401 you'll need to figure out what's not right
> >> before proceeding.
> >>
> >> It sounds like the site may also be secured using session-based
> >> authentication.  If a cookie is involved then you need to configure
> >> session auth in order to get to any session-protected pages.  The
> >> trick is that, for session-based auth, you need to fully understand
> >> the sequence of pages and forms that happen when a user visits the
> >> site and is granted the cookie(s) - the login process, what content
> >> URLs are protected, what URLs are part of the login sequence, etc.
> >> The end-user documentation describes this in some detail.  It can be a
> >> challenge to get it all set up right.
> >>
> >> Finally, for SharePoint sites, if you are intending to index
> >> documents, you might well find the SharePoint Connector a better
> >> choice than trying to crawl the site with the web connector.
> >>
> >> Thanks,
> >> Karl
> >>
> >> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <rene@nederhand.net>
> wrote:
> >>> Hi,
> >>>
> >>> I am trying to get ManifoldCF crawl our electronic learning
> >>> environment (Blackboard). To enable single sign-on, our institution
> >>> has placed an ISA server as proxy before Blackboard.
> >>> This is giving me a lot of problems.
> >>>
> >>> I've managed to get passed the ISA server using session based
> >>> authentication, but then I am stuck at a 401 error message. According
> >>> to our architect, ISA is responsible for the communication with
> >>> Blackboard and will set a cookie so Blackboard will know it a
> >>> legitimate user is accessing its service. I think, ManifoldCF is not
> >>> able to handle this cookie and hence is not able to access Blackboard.
> >>> Am I right? If so, is there a possibility to get Blackboard indexed?
> >>>
> >>> By the way, the same authentication is used for our Sharepoint. I
> >>> would like to index this as well....
> >>>
> >>> Any help on solving this problem is appreciated.
> >>>
> >>> Cheers,
> >>>
> >>> René
>

Mime
View raw message