manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TC Tobin-Campbell ...@epic.com>
Subject RE: ManifoldCF and Kerberos/Basic Authentication
Date Fri, 07 Jun 2013 15:52:43 GMT
I took a look at the output connection, and didn't see anything in there that looked like it
would cause any issues. I'm including all of the default MIME and file extensions. This should
just be html I would think.
[cid:image002.jpg@01CE636D.22DB8540]

Here's what I'm seeing in the DEBUG output. It seems like we are starting the extraction,
but then just aren't doing anything with it?? Seems weird.

DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Waiting to start getting a connection
to http://10.8.159.161:80
DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Attempting to get connection to
http://10.8.159.161:80 (0 ms)
DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Successfully got connection to http://10.8.159.161:80
(0 ms)
DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Waiting to start getting a connection
to http://10.8.159.161:80
DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Attempting to get connection to
http://10.8.159.161:80 (0 ms)
DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Successfully got connection to http://10.8.159.161:80
(0 ms)
DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Waiting for an HttpClient object
DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For http://wiki/main/EpicSearch/Test,
discovered matching authentication credentials
DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For http://wiki/main/EpicSearch/Test,
setting virtual host to wiki
DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Got an HttpClient object after 0
ms.
DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Get method for '/main/EpicSearch/Test'
DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Waiting for an HttpClient object
DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For http://wiki.epic.com/main/EpicSearch/Test,
discovered matching authentication credentials
DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For http://wiki.epic.com/main/EpicSearch/Test,
setting virtual host to wiki.epic.com
DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Got an HttpClient object after 0
ms.
DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Get method for '/main/EpicSearch/Test'
WARN 2013-06-07 10:40:27,900 (Thread-2185) - NEGOTIATE authentication error: Invalid name
provided (Mechanism level: Could not load configuration file C:\Windows\krb5.ini (The system
cannot find the file specified))
WARN 2013-06-07 10:40:27,900 (Thread-2188) - NEGOTIATE authentication error: Invalid name
provided (Mechanism level: Could not load configuration file C:\Windows\krb5.ini (The system
cannot find the file specified))
DEBUG 2013-06-07 10:40:28,378 (Thread-2185) - WEB: Performing a read wait on bin 'wiki' of
128 ms.
DEBUG 2013-06-07 10:40:28,506 (Thread-2185) - WEB: Performing a read wait on bin 'wiki' of
50 ms.
DEBUG 2013-06-07 10:40:28,556 (Thread-2185) - WEB: Performing a read wait on bin 'wiki' of
64 ms.
DEBUG 2013-06-07 10:40:28,613 (Thread-2188) - WEB: Performing a read wait on bin 'wiki.epic.com'
of 126 ms.
DEBUG 2013-06-07 10:40:28,620 (Thread-2185) - WEB: Performing a read wait on bin 'wiki' of
47 ms.
INFO 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: FETCH URL|http://wiki/main/EpicSearch/Test|1370619627893+787|200|14438|
DEBUG 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: Document 'http://wiki/main/EpicSearch/Test'
is text, with encoding 'utf-8'; link extraction starting

Followed by lots of these, which seems appropriate:
DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url 'http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
is illegal because no include patterns match it
DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html document 'http://wiki/main/EpicSearch/Test',
found an unincluded URL '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'
DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url 'http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
is illegal because no include patterns match it
DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html document 'http://wiki/main/EpicSearch/Test',
found an unincluded URL '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'

TC Tobin-Campbell | Technical Services | Willow | Epic  | (608) 271-9000

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Friday, June 07, 2013 9:49 AM
To: user@manifoldcf.apache.org
Subject: Re: ManifoldCF and Kerberos/Basic Authentication

Hi TC,
The fact that the fetch is successful means that the URL is included (and not excluded). 
The fact that it doesn't mention a robots exclusion means that robots.txt is happy with it.
 But it could well be that:
(a) the mimetype is one that your ElasticSearch connection is excluding;
(b) the extension is one that your ElasticSearch connection is excluding.
I would check your output connection, and if that doesn't help turn on connector debugging
(in properties.xml, set property "org.apache.manifoldcf.connectors" to "DEBUG").  Then you
will see output that describes the consideration process the web connector is going through
for each document.
Karl

On Fri, Jun 7, 2013 at 10:43 AM, TC Tobin-Campbell <TC@epic.com<mailto:TC@epic.com>>
wrote:
Apologies for the delay here Karl. I was able to get this up and running, and the authentication
is working. Thanks for getting that in so quickly!

I do have a new issue though. I have an output connection to Elasticsearch setup for this
job.

I can see that the crawler is in fact crawling the wiki, and the fetches are all working great.
However, it doesn't seem to be attempting to send the pages to the index.

[cid:image001.png@01CE636C.6CDCF1C0]

I'm not seeing anything in the elasticsearch logs, so it appears we're just not sending anything
to Elasticsearch. Could this be related to the change you made? Or is this a completely separate
problem?

TC Tobin-Campbell | Technical Services | Willow | Epic  | (608) 271-9000<tel:%28608%29%20271-9000>

From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: Friday, May 24, 2013 12:50 PM

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: ManifoldCF and Kerberos/Basic Authentication

I had a second so I finished this.  Trunk now has support for basic auth.  You enter the credentials
on the server tab underneath the API credentials.  Please give it a try and let me know if
it works for you.

Karl

On Fri, May 24, 2013 at 11:28 AM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
CONNECTORS-692.  I will probably look at this over the weekend.
Karl

On Fri, May 24, 2013 at 11:26 AM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
Hi TC,
Unless I'm very much mistaken, there are no Apache kerberos session cookies being used on
your site, so it should be a straightforward matter to include basic auth credentials to your
Apache mod-auth-kerb module for all pages during crawling.
I'll create a ticket for this.

Karl

On Fri, May 24, 2013 at 11:14 AM, TC Tobin-Campbell <TC@epic.com<mailto:TC@epic.com>>
wrote:
Hi Karl,
Here's what I know so far.

Our module is configured to use two auth methods: Negotiate and Basic.  In most cases, we
use Negotiate, but I'm guessing you'd prefer Basic.

Here's an example header.

GET / HTTP/1.1
Host: wiki.epic.com<http://wiki.epic.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: wooTracker=QOMVLXDIC6OGOUXMGST1O54HYW573NNC; .EPICASPXAUTHQA=FA94C945F613DACB9341384EBB1C28C52CFC52558E606FC2F880DD5BA811BE7E94301C7A0A1990FAC2E119AABB8591EC975059A2B8169BEA9FC525D0577F3C0EC56DC29C28880D23E0790AD890024FB57A338981606774259656B6971556645B095778115ADFE6B9B434970869C4B546A59A61B2CDEF0C0A5B23E80BB1D1E3D3D567E4C113D9E7B32D137FDEE65E51AC7B3DF5A04F9767FA7C8723140AC274E2695D939C716D9B49CCF0F1D79967CE902781BC8CB5A253E3FB39896021ABB4F2FCA01D0E138E00A8176EB2ECE5B0204597C21969C8F501A9EDE4D27694E699777BB179CD329748B3341A4BBF3085C447E2B55BE97E27D23E415C23F1A53A33A15551D9AE6B5CF255C3B8ECE038A481B8291A8EC46F0EA8730C3658DABC5BE7557C6659321677D8F4586CA79D6D5CCCB1C5687F9077A6CD96487EAEF417A1411C2F62BE6FF57DD1F515B16406CF4B0B9460EFB9BCB46F8F7E47FCB8E8CE4FAE2EB92F20DECEF2BBF1D95C80597BE935A031CD158593EFA2E446FA6FAFDD2B4E691CD8569B7D60DAD4378EBD6A138E23F0F616FD01443647D9A6F852AEF773A69580390496748241739C0DDF2791B1C2143B7E9E976754056B70EB846DAE1D7018CC40026F862ABF613D89C8D31B2C468B81D0C18C37697E8BA5D415F8DFCA37AF2935AAD0238ED6F652E24062849EC8E0C4651C4FB8BB9DD11BE4F8639AD690C791868B8E94ADB626C9B1BD8E334F675E664A03DC;
wiki_pensieve_session=j1pcf1746js1442m7p92hag9g1; wiki_pensieveUserID=5; wiki_pensieveUserName=Lziobro;
wiki_pensieveToken=********************be3a3a990a8a
Connection: keep-alive
Authorization: Basic bHppb**************xMjM0   <-I've censored this line so you cannot
get my password

If I'm understanding you correctly, there's no way to accomplish this currently? Or, is there
some workaround we could implement?

TC Tobin-Campbell | Technical Services | Willow | Epic  | (608) 271-9000<tel:%28608%29%20271-9000>

From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: Thursday, May 16, 2013 12:05 PM
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: ManifoldCF and Kerberos/Basic Authentication

Hi TC,

Apparently mod-auth-kerb can be configured in a number of different ways.  But if yours will
work with basic auth, we can just transmit the credentials each time.  It will be relatively
slow because mod-auth-kerb will then need to talk to the kdc on each page fetch, but it should
work.  Better yet would be if Apache set a browser cookie containing your tickets, which it
knew how to interpret if returned - but I don't see any Google evidence that mod-auth-kerb
is capable of that.  But either of these two approaches we could readily implement.
FWIW, the standard way to work with kerberos is for you to actually have tickets already kinit'd
and installed on your machine.  Your browser then picks up those tickets and transmits them
to the Wiki server (I presume in a header that mod-auth-kerb knows about), and the kdc does
not need to be involved.  But initializing that kind of ticket store, and managing the associated
kinit requests when necessary, are beyond the scope of any connector we've so far done, so
if we had to go that way, that would effectively make this proposal a Research Project.
What would be great to know in advance is how exactly your browser interacts with your Apache
server.  Are you familiar with the process of getting a packet dump?  You'd use a tool like
tcpdump (Unix) or wireshark (windows) in order to capture the packet traffic between a browser
session and your Apache server, to see exactly what is happening.  Start by shutting down
all your browser windows, so there is no in-memory state, and then start the capture and browse
to a part of the wiki that is secured by mod-auth-kerb.  We'd want to see if cookies get set,
or if any special headers get transmitted by your browser (other than the standard Basic Auth
"Authentication" headers).  If the exchange is protected by SSL, then you'll have to use FireFox
and use a plugin called LiveHeaders to see what is going on instead.
Please let me know what you find.
Karl


On Thu, May 16, 2013 at 12:37 PM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
Hi TC,
Thanks, this is a big help in understanding your setup.
I don't know enough about exactly *how* mod-auth-kerb uses Basic Auth to communicate with
the browser, and whether it expects the browser to cache the resulting tickets (in cookies?)
 I will have to do some research and get back to you on that.
Basically, security for a Wiki is usually handled by the Wiki, but since you've put added
auth in front of it by going through mod-auth-kerb, it's something that the Wiki connector
would have to understand (and emulate your browser) in order to implement.  So it does not
likely support this right now.  It may be relatively easy to do or it may be a challenge -
we'll see.  I would also be somewhat concerned that it may not possible to actually reach
the API urls through Apache; that would make everything moot if it were true.  Could you confirm
that you can visit API urls through your Apache setup?
Karl

On Thu, May 16, 2013 at 12:21 PM, TC Tobin-Campbell <TC@epic.com<mailto:TC@epic.com>>
wrote:
Hi there,
I'm trying to connect ManifoldCF to an internal wiki at my company. The ManifoldCF wiki connector
supplies a username and password field for the wiki api, however, at my company, a username
and password is required to connect to the apache server running the wiki site, and after
that authentication takes place, those credentials are passed on to the wiki api.

So, essentially, I need a way to have ManifoldCF pass my windows credentials on when trying
to make its connection. Using the api login fields does not work.

We use Kerberos the Kerberos Module for Apache<http://modauthkerb.sourceforge.net/index.html>
(AuthType Kerberos).  My understanding based on that linked documentation is that this module
does use Basic Auth to communicate with the browser.

Is there anything we can to make ManifoldCF authenticate in this scenario?

Thanks,


TC Tobin-Campbell | Technical Services | Willow | Epic  | (608) 271-9000<tel:%28608%29%20271-9000>

Sherlock<https://sherlock.epic.com/> (Issue tracking)
Analyst Toolkits<https://sites.epic.com/epiclib/epicdoc/Pages/analyst/default.aspx>
(Common setup and support tasks)
Report Repository<https://documentation.epic.com/DataHandbook/Reports/ReportSearch>
(Epic reports documentation)
Nova<https://nova.epic.com/Login/GetOrg.aspx?returnUrl=%2fdefault.aspx> (Release note
management)
Galaxy<https://documentation.epic.com/OnlineDoc/Documents.aspx> (Epic documentation)








Mime
View raw message