manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kooloos <mkool...@hotmail.com>
Subject RE: Web connector - Session-based access credentials
Date Mon, 03 Sep 2012 13:04:21 GMT

Sorry, mean "second last page" ;)

> Date: Mon, 3 Sep 2012 08:18:02 -0400
> Subject: Re: Web connector - Session-based access credentials
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
> 
> What do you mean, "first last page"?
> The Web Connector needs to refetch the page that caused the
> redirection, because that is likely to be a content page based on the
> user's own description of the login sequence.  Otherwise pages would
> be missing from the crawl, whenever login needed to be redone.
> 
> Karl
> 
> On Mon, Sep 3, 2012 at 7:43 AM, Michael Kooloos <mkooloos@hotmail.com> wrote:
> > No, same thing happens in the browser also, so need to find a different seed
> > page that doesn't have this behaviour, but no luck there yet..
> >
> > Other way to solve this 'issue' is if the coonnector will go back to the
> > first last page after finishing the login-sequence, instead of the last page
> > (since the last page stays in a loop). Should be possible, right?
> >
> > Michael
> >
> >> Date: Mon, 3 Sep 2012 07:15:20 -0400
> >
> >> Subject: Re: Web connector - Session-based access credentials
> >> From: daddywri@gmail.com
> >> To: user@manifoldcf.apache.org
> >>
> >> Ok - if the redirect is occurring in a browser whether or not you are
> >> logged in, then yes, you cannot use that page as a seed. If this only
> >> seems to happen in the Web Connector, on the other hand, we should
> >> keep talking, because your login sequence is not actually succeeding
> >> to set up the session cookies properly.
> >>
> >> Thanks!
> >> Karl
> >>
> >> On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <mkooloos@hotmail.com>
> >> wrote:
> >> > Hi Karl,
> >> >
> >> > Thanks. Found the issue, the seed document keeps redirecting to the
> >> > logon
> >> > page (even after login has occured). This is an issue (protection?) of
> >> > the
> >> > website and it now makes sense to me why the connector stays in a loop.
> >> > Haven't found a solution yet, have to find a more appropriate seed
> >> > document
> >> > or a way to skip the redirect the second time it enters the loop..
> >> >
> >> > Many thanks for your support!
> >> >
> >> >> Date: Thu, 30 Aug 2012 11:52:01 -0400
> >> >
> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> From: daddywri@gmail.com
> >> >> To: user@manifoldcf.apache.org
> >> >>
> >> >> If I understand how you have it set up, what the ManifoldCF web
> >> >> connector will do is this:
> >> >>
> >> >> (1) Fetch the seed document.
> >> >> (2) Take the redirection to the logon page, and thus enter the login
> >> >> sequence
> >> >> (3) Do the login sequence and establish the correct cookies
> >> >> (4) Refetch the seed document
> >> >> (5) Take the redirection to the logon page...
> >> >>
> >> >> So, as you can see, your seed document must redirect ONLY if login
has
> >> >> not yet occurred, or you will be stuck in a loop. So either fix that,
> >> >> or choose a more appropriate seed document.
> >> >>
> >> >> On normal site, typically you get different results on most content
> >> >> pages when login has occurred vs. when login has not yet occurred.
It
> >> >> is up to you to define in the Web Connector what combination of pages
> >> >> and content constitute a logon request vs. normal content fetch. And
> >> >> that's the whole problem, and why this is so complicated.
> >> >>
> >> >> Thanks,
> >> >> Karl
> >> >>
> >> >> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos
> >> >> <mkooloos@hotmail.com>
> >> >> wrote:
> >> >> > Karl,
> >> >> >
> >> >> > My seed document is not a logon page, but the seed document url
> >> >> > automatically redirects to the logon pages. So the first regex
is of
> >> >> > the
> >> >> > logon page, then the regex for the Login URL is the same (since
it's
> >> >> > the
> >> >> > logon page), type = Form. Do I define any redirect after the logon
> >> >> > form?
> >> >> >
> >> >> > Hope this makes a bit of sence..
> >> >> >
> >> >> > Didn't think it would be that hard to setup some access credentials..
> >> >> >
> >> >> >> Date: Thu, 30 Aug 2012 10:03:20 -0400
> >> >> >
> >> >> >> Subject: Re: Web connector - Session-based access credentials
> >> >> >> From: daddywri@gmail.com
> >> >> >> To: user@manifoldcf.apache.org
> >> >> >>
> >> >> >> It sounds like your regular expression(s) which describe what
pages
> >> >> >> belong to the logon sequence may be incorrect. After the logon
> >> >> >> sequence exits, the crawler will attempt to refetch the page
it was
> >> >> >> working on before it entered the logon sequence. If that page
is
> >> >> >> PART
> >> >> >> of the logon sequence it will loop as you describe.
> >> >> >>
> >> >> >> Your seed documents should therefore NOT be logon pages or
you will
> >> >> >> never get anywhere...
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos
> >> >> >> <mkooloos@hotmail.com>
> >> >> >> wrote:
> >> >> >> > Karl,
> >> >> >> >
> >> >> >> > I've read through the similar problems/questions on the
list (only
> >> >> >> > found
> >> >> >> > 3),
> >> >> >> > but without any luck. In the Seed I've the page I want
to crawl,
> >> >> >> > but
> >> >> >> > this on
> >> >> >> > protected by security, so I setup a redirect to the login-page
and
> >> >> >> > a
> >> >> >> > form
> >> >> >> > for the login-page with the username/password parameters.
When I
> >> >> >> > look
> >> >> >> > in
> >> >> >> > the
> >> >> >> > Simple History I see the fetch of the first page, the
begin-logon,
> >> >> >> > redirect
> >> >> >> > to the login-page, the end-logon, but then it starts
all over
> >> >> >> > again
> >> >> >> > and
> >> >> >> > keeps in a loop. Any ideas? I think a working example
will help me
> >> >> >> > a
> >> >> >> > lot..
> >> >> >> >
> >> >> >> > Michael
> >> >> >> >
> >> >> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400
> >> >> >> >> Subject: Re: Web connector - Session-based access
credentials
> >> >> >> >> From: daddywri@gmail.com
> >> >> >> >> To: user@manifoldcf.apache.org
> >> >> >> >
> >> >> >> >>
> >> >> >> >> I set it up to crawl Angie's List at one point. It
was developed
> >> >> >> >> to
> >> >> >> >> crawl an oil-and-gas exploration subscription site.
Others have
> >> >> >> >> fielded fairly detailed questions and/or problems
to this list,
> >> >> >> >> so I
> >> >> >> >> know it has been used by many.
> >> >> >> >>
> >> >> >> >> Can you give a more thorough and detailed description
of what
> >> >> >> >> your
> >> >> >> >> are
> >> >> >> >> trying to crawl, and what is happening for you?
> >> >> >> >>
> >> >> >> >> Karl
> >> >> >> >>
> >> >> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos
> >> >> >> >> <mkooloos@hotmail.com>
> >> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> > Hi,
> >> >> >> >> >
> >> >> >> >> > Does anyone have a working example of the session-based
access
> >> >> >> >> > credentials
> >> >> >> >> > for the web connector? Following the end-user-documentation
as
> >> >> >> >> > good
> >> >> >> >> > as
> >> >> >> >> > possible, but still no luck :(
> >> >> >> >> >
> >> >> >> >> > Thanks!
 		 	   		  
Mime
View raw message