incubator-connectors-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector
Date Sun, 16 Oct 2011 01:40:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128319#comment-13128319
] 

Karl Wright commented on CONNECTORS-275:
----------------------------------------

It would be great to hear some clarification on why pages that obviously would be needed for
a user to log into this site using a browser do not exist.  The Web Connector is designed
to permit crawling only of sites that can be visited by a human being with a browser; it's
not a generic HTTP API crawler by any stretch.

To answer specific questions about the connector itself:

bq. But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and
also a "Form name/link target" regex.

The "Login URL" is to allow you to specify, via URL regular expression, which pages are part
of the login page sequence.  The "Form name/link target" regexp, combined with the "page type"
you mentioned, together determine what ManifoldCF regards as a fetch that is part of the login
sequence, and one that is not.  As it says in the end-user documentation: "You declare a page
to be a login page by identifying it both by its URL, and by what the crawler finds on the
page when it fetches it. "

bq. For "rediection", am I saying "look for a redirect event", or am I saying "then DO a redirect
to this page".

You are saying that a page fetch that matches the URL that is a redirection will be considered
part of the login sequence, and is thus not indexable content.

bq. And for "form name", what if my login page doesn't have a named form? In the case of the
site I'm trying to spider, when your session expires, you manually go back to an https page
and supply your username and password as CGI parameters. I know this sounds odd, but it's
apparently how a number of the sites we're trying to spider work, some proprietary software.

You can match a missing or blank form name with an empty regexp, or even more specifically
"^$", which ONLY matches the empty string.

Hope this helps.

                
> Clarify documentation as to how to set up session login for web connector
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-275
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Documentation, Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Karl Wright
>
> A book reader has this comment, which basically implies that we need to improve the documentation
for the web connector:
> "I was excited to get the full version of the online book, but then disappointed when
it referred back to the online doc for setting up logins for a Web spidering. The online doc
is very vague and only gives one example. I've used Ultraseek's and Google's spider, but I
still find the Session login sequences non-obvious.
> I've got a subscription request into the user mailing list, but here's the parts that
are not clear.
> I generally understand about using regexes to define sites and sorting out content pages
from login pages.
> But it's not clear why there's TWO Regex's per entry. There's a "Login URL" regex, and
also a "Form name/link target" regex.
> It's also not clear about the "page type" radio button choices.
> For "rediection", am I saying "look for a redirect event", or am I saying "then DO a
redirect to this page".
> And for "form name", what if my login page doesn't have a named form? In the case of
the site I'm trying to spider, when your session expires, you manually go back to an https
page and supply your username and password as CGI parameters. I know this sounds odd, but
it's apparently how a number of the sites we're trying to spider work, some proprietary software.
> Karl, I really think the book or Wiki or doc needs 3 or 4 different examples of login
scenarios.
> Here's the scenario I'm trying, if you'd like to use it:
> Try to fetch: http://site.com/product?id=1234
> If you get a redirect to: http://site.com/Main.asp
> Note that there's no login form nor link on this page.
> Then invoke this login URL: https://site.com/validate?username=me&password=that&otherArg=something
> Note that you can't just visit this page and fill in a form, that gives an error, it
has to be passed in (I think as a GET)
> Then record the session cookie and try for /product?id=1234 again.
> I realize this is odd, I didn't design it. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message