manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF and Kerberos/Basic Authentication
Date Thu, 13 Jun 2013 15:39:20 GMT
I just checked the ES connector; it properly tracks specification changes,
so you should not need to click the
"reingest all documents" button.  Instead, I suspect it is the mime type
that is causing the problem.  I'll try to find a way to get better feedback
about reasons for document exclusion into the web connector.
CONNECTORS-715 captures this work item.

Karl



On Thu, Jun 13, 2013 at 11:28 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi TC,
>
> I don't know if the elastic search connector properly tracks changes to
> configuration; it's possible that your specification change is not properly
> causing a reindex of the excluded documents.
>
> Can you click on the reingest all documents button in your es connection
> and restart the job and see if that makes a difference?
>
> Also, "." does not apply to mime types at this time.  It sounds to me like
> you are really looking for a wildcard for mimetypes, since you don't know
> what the right mimetypes are.   Can you determine the mime type of your
> wiki pages so we don't have to guess what it is?
>
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: TC Tobin-Campbell
> Sent: 6/13/2013 10:58 AM
> To: user@manifoldcf.apache.org
> Subject: RE: ManifoldCF and Kerberos/Basic Authentication
>
>  I added in . for both allowed file extensions and allowed MIME types.
> Still no luck. Is there something else I should be looking at in the job or
> something?
>
>
>
>
>
> DEBUG 2013-06-13 09:55:21,794 (Worker thread '44') - WEB: Decided not to
> ingest 'http://wiki/main/EpicSearch/Test' because it did not match
> ingestability criteria
>
>
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Thursday, June 13, 2013 8:47 AM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication
>
>
>
> Hi TC,
>
> Please read the comments in the ticket.  You will need to change your
> ElasticSearch extension list in order for it to accept documents with no
> extension.  To do that you need to add  a new extension of "." to your
> extension list.
>
> Karl
>
>
>
> On Thu, Jun 13, 2013 at 9:43 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>
> Hey Karl,
>
> I updated my working copy, rebuilt using ant, and tried again. I’m still
> not getting anything to post to Elasticsearch.
>
>
>
>
>
> I did notice this line in the logfile.
>
>
>
> DEBUG 2013-06-13 08:25:36,976 (Worker thread '3') - WEB: Decided not to
> ingest 'http://wiki/main/EpicSearch/Test' because it did not match
> ingestability criteria
>
>
>
> I was poking around in my setup, and still am not seeing anything
> configured incorrectly. Any other thoughts?
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Friday, June 07, 2013 12:29 PM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication
>
>
>
> Fix checked into trunk.
> Karl
>
>
>
> On Fri, Jun 7, 2013 at 12:42 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> I created the ticket: CONNECTORS-707.
>
>
>
> On Fri, Jun 7, 2013 at 12:16 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> I looked at the ElasticSearch connector, and it's going to treat these
> extensions as being "" (empty string).  So your list of allowed extensions
> will have to include "" if such documents are to be ingested.
>
> Checking now to see if in fact you can just add a blank line to the list
> of extensions to get this to happen... it looks like you can't:
>
> >>>>>>
>       while ((line = br.readLine()) != null)
>       {
>         line = line.trim();
>         if (line.length() > 0)
>           set.add(line);
>       }
> <<<<<<
>
> So, the ElasticSearch connector in its infinite wisdom excludes all
> documents that have no extension.  Hmm.
>
> Can you open a ticket for this problem?  I'm not quite sure yet how to
> address it, but clearly this needs to be fixed.
>
> Karl
>
>
>
> On Fri, Jun 7, 2013 at 12:07 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> The extension of a document comes from the url.  So for the urls listed in
> your previous mail, they don't appear to have any extension at all.
>
> The code here from the web connector rejects documents because of various
> reasons, but does not log it:
>
> >>>>>>
>     if (cache.getResponseCode(documentIdentifier) != 200)
>       return false;
>
>     if
> (activities.checkLengthIndexable(cache.getDataLength(documentIdentifier))
> == false)
>       return false;
>
>     if (activities.checkURLIndexable(documentIdentifier) == false)
>       return false;
>
>     if (filter.isDocumentIndexable(documentIdentifier) == false)
>       return false;
>
> <<<<<<
>
> All you would see if any one of these conditions failed would be:
>
>           if (Logging.connectors.isDebugEnabled())
>             Logging.connectors.debug("WEB: Decided not to ingest
> '"+documentIdentifier+"' because it did not match ingestability criteria");
>
> Do you see that in the log?
>
> Also, bear in mind that since the crawler is incremental, you may need to
> kick it to make it retry all this so you get debugging output.  You can
> click the "reingest all" link on your output connection to make that
> happen...
>
> Karl
>
>
>
> On Fri, Jun 7, 2013 at 11:52 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>
> I took a look at the output connection, and didn’t see anything in there
> that looked like it would cause any issues. I’m including all of the
> default MIME and file extensions. This should just be html I would think.
>
>
>
> Here’s what I’m seeing in the DEBUG output. It seems like we are starting
> the extraction, but then just aren’t doing anything with it?? Seems weird.
>
>
>
> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Waiting to start
> getting a connection to http://10.8.159.161:80
>
> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Attempting to
> get connection to http://10.8.159.161:80 (0 ms)
>
> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Successfully got
> connection to http://10.8.159.161:80 (0 ms)
>
> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Waiting to start
> getting a connection to http://10.8.159.161:80
>
> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Attempting to
> get connection to http://10.8.159.161:80 (0 ms)
>
> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Successfully got
> connection to http://10.8.159.161:80 (0 ms)
>
> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Waiting for an
> HttpClient object
>
> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For
> http://wiki/main/EpicSearch/Test, discovered matching authentication
> credentials
>
> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For
> http://wiki/main/EpicSearch/Test, setting virtual host to wiki
>
> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Got an
> HttpClient object after 0 ms.
>
> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Get method for
> '/main/EpicSearch/Test'
>
> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Waiting for an
> HttpClient object
>
> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For
> http://wiki.epic.com/main/EpicSearch/Test, discovered matching
> authentication credentials
>
> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For
> http://wiki.epic.com/main/EpicSearch/Test, setting virtual host to
> wiki.epic.com
>
> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Got an
> HttpClient object after 0 ms.
>
> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Get method for
> '/main/EpicSearch/Test'
>
> WARN 2013-06-07 10:40:27,900 (Thread-2185) - NEGOTIATE authentication
> error: Invalid name provided (Mechanism level: Could not load configuration
> file C:\Windows\krb5.ini (The system cannot find the file specified))
>
> WARN 2013-06-07 10:40:27,900 (Thread-2188) - NEGOTIATE authentication
> error: Invalid name provided (Mechanism level: Could not load configuration
> file C:\Windows\krb5.ini (The system cannot find the file specified))
>
> DEBUG 2013-06-07 10:40:28,378 (Thread-2185) - WEB: Performing a read wait
> on bin 'wiki' of 128 ms.
>
> DEBUG 2013-06-07 10:40:28,506 (Thread-2185) - WEB: Performing a read wait
> on bin 'wiki' of 50 ms.
>
> DEBUG 2013-06-07 10:40:28,556 (Thread-2185) - WEB: Performing a read wait
> on bin 'wiki' of 64 ms.
>
> DEBUG 2013-06-07 10:40:28,613 (Thread-2188) - WEB: Performing a read wait
> on bin 'wiki.epic.com' of 126 ms.
>
> DEBUG 2013-06-07 10:40:28,620 (Thread-2185) - WEB: Performing a read wait
> on bin 'wiki' of 47 ms.
>
> INFO 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: FETCH URL|
> http://wiki/main/EpicSearch/Test|1370619627893+787|200|14438|<http://wiki/main/EpicSearch/Test%7C1370619627893+787%7C200%7C14438%7C>
>
> DEBUG 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: Document '
> http://wiki/main/EpicSearch/Test' is text, with encoding 'utf-8'; link
> extraction starting
>
>
>
> *Followed by lots of these, which seems appropriate:*
>
> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url '
> http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
> is illegal because no include patterns match it
>
> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html document
> 'http://wiki/main/EpicSearch/Test', found an unincluded URL
> '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'
>
> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url '
> http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
> is illegal because no include patterns match it
>
> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html document
> 'http://wiki/main/EpicSearch/Test', found an unincluded URL
> '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Friday, June 07, 2013 9:49 AM
>
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication
>
>
>
> Hi TC,
>
> The fact that the fetch is successful means that the URL is included (and
> not excluded).  The fact that it doesn't mention a robots exclusion means
> that robots.txt is happy with it.  But it could well be that:
>
> (a) the mimetype is one that your ElasticSearch connection is excluding;
>
> (b) the extension is one that your ElasticSearch connection is excluding.
>
> I would check your output connection, and if that doesn't help turn on
> connector debugging (in properties.xml, set property
> "org.apache.manifoldcf.connectors" to "DEBUG").  Then you will see output
> that describes the consideration process the web connector is going through
> for each document.
>
> Karl
>
>
>
> On Fri, Jun 7, 2013 at 10:43 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>
> Apologies for the delay here Karl. I was able to get this up and running,
> and the authentication is working. Thanks for getting that in so quickly!
>
>
>
> I do have a new issue though. I have an output connection to Elasticsearch
> setup for this job.
>
>
>
> I can see that the crawler is in fact crawling the wiki, and the fetches
> are all working great. However, it doesn’t seem to be attempting to send
> the pages to the index.
>
>
>
>
>
> I’m not seeing anything in the elasticsearch logs, so it appears we’re
> just not sending anything to Elasticsearch. Could this be related to the
> change you made? Or is this a completely separate problem?
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Friday, May 24, 2013 12:50 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication
>
>
>
> I had a second so I finished this.  Trunk now has support for basic auth.
> You enter the credentials on the server tab underneath the API
> credentials.  Please give it a try and let me know if it works for you.
>
> Karl
>
>
>
> On Fri, May 24, 2013 at 11:28 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> CONNECTORS-692.  I will probably look at this over the weekend.
>
> Karl
>
>
>
> On Fri, May 24, 2013 at 11:26 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi TC,
>
> Unless I'm very much mistaken, there are no Apache kerberos session
> cookies being used on your site, so it should be a straightforward matter
> to include basic auth credentials to your Apache mod-auth-kerb module for
> all pages during crawling.
>
> I'll create a ticket for this.
>
> Karl
>
>
>
> On Fri, May 24, 2013 at 11:14 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>
> Hi Karl,
>
> Here’s what I know so far.
>
>
>
> Our module is configured to use two auth methods: Negotiate and Basic.  In
> most cases, we use Negotiate, but I’m guessing you’d prefer Basic.
>
>
>
> Here’s an example header.
>
>
>
> GET / HTTP/1.1
>
> Host: wiki.epic.com
>
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101
> Firefox/20.0
>
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
> Accept-Language: en-US,en;q=0.5
>
> Accept-Encoding: gzip, deflate
>
> Cookie: wooTracker=QOMVLXDIC6OGOUXMGST1O54HYW573NNC;
> .EPICASPXAUTHQA=FA94C945F613DACB9341384EBB1C28C52CFC52558E606FC2F880DD5BA811BE7E94301C7A0A1990FAC2E119AABB8591EC975059A2B8169BEA9FC525D0577F3C0EC56DC29C28880D23E0790AD890024FB57A338981606774259656B6971556645B095778115ADFE6B9B434970869C4B546A59A61B2CDEF0C0A5B23E80BB1D1E3D3D567E4C113D9E7B32D137FDEE65E51AC7B3DF5A04F9767FA7C8723140AC274E2695D939C716D9B49CCF0F1D79967CE902781BC8CB5A253E3FB39896021ABB4F2FCA01D0E138E00A8176EB2ECE5B0204597C21969C8F501A9EDE4D27694E699777BB179CD329748B3341A4BBF3085C447E2B55BE97E27D23E415C23F1A53A33A15551D9AE6B5CF255C3B8ECE038A481B8291A8EC46F0EA8730C3658DABC5BE7557C6659321677D8F4586CA79D6D5CCCB1C5687F9077A6CD96487EAEF417A1411C2F62BE6FF57DD1F515B16406CF4B0B9460EFB9BCB46F8F7E47FCB8E8CE4FAE2EB92F20DECEF2BBF1D95C80597BE935A031CD158593EFA2E446FA6FAFDD2B4E691CD8569B7D60DAD4378EBD6A138E23F0F616FD01443647D9A6F852AEF773A69580390496748241739C0DDF2791B1C2143B7E9E976754056B70EB846DAE1D7018CC40026F862ABF613D89C8D31B2C468B81D0C18C37697E8BA5D415F8DFCA37AF2935AAD0238ED6F652E24062849EC8E0C4651C4FB8BB9DD11BE4F8639AD690C791868B8E94ADB626C9B1BD8E334F675E664A03DC;
> wiki_pensieve_session=j1pcf1746js1442m7p92hag9g1; wiki_pensieveUserID=5;
> wiki_pensieveUserName=Lziobro;
> wiki_pensieveToken=********************be3a3a990a8a
>
> Connection: keep-alive
>
> Authorization: Basic bHppb**************xMjM0   <-I've censored this line
> so you cannot get my password
>
>
>
> If I’m understanding you correctly, there’s no way to accomplish this
> currently? Or, is there some workaround we could implement?
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Thursday, May 16, 2013 12:05 PM
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication
>
>
>
> Hi TC,
>
> Apparently mod-auth-kerb can be configured in a number of different ways.
> But if yours will work with basic auth, we can just transmit the
> credentials each time.  It will be relatively slow because mod-auth-kerb
> will then need to talk to the kdc on each page fetch, but it should work.
> Better yet would be if Apache set a browser cookie containing your tickets,
> which it knew how to interpret if returned - but I don't see any Google
> evidence that mod-auth-kerb is capable of that.  But either of these two
> approaches we could readily implement.
>
> FWIW, the standard way to work with kerberos is for you to actually have
> tickets already kinit'd and installed on your machine.  Your browser then
> picks up those tickets and transmits them to the Wiki server (I presume in
> a header that mod-auth-kerb knows about), and the kdc does not need to be
> involved.  But initializing that kind of ticket store, and managing the
> associated kinit requests when necessary, are beyond the scope of any
> connector we've so far done, so if we had to go that way, that would
> effectively make this proposal a Research Project.
>
> What would be great to know in advance is how exactly your browser
> interacts with your Apache server.  Are you familiar with the process of
> getting a packet dump?  You'd use a tool like tcpdump (Unix) or wireshark
> (windows) in order to capture the packet traffic between a browser session
> and your Apache server, to see exactly what is happening.  Start by
> shutting down all your browser windows, so there is no in-memory state, and
> then start the capture and browse to a part of the wiki that is secured by
> mod-auth-kerb.  We'd want to see if cookies get set, or if any special
> headers get transmitted by your browser (other than the standard Basic Auth
> "Authentication" headers).  If the exchange is protected by SSL, then
> you'll have to use FireFox and use a plugin called LiveHeaders to see what
> is going on instead.
>
> Please let me know what you find.
>
> Karl
>
>
>
>
>
> On Thu, May 16, 2013 at 12:37 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi TC,
>
> Thanks, this is a big help in understanding your setup.
>
> I don't know enough about exactly *how* mod-auth-kerb uses Basic Auth to
> communicate with the browser, and whether it expects the browser to cache
> the resulting tickets (in cookies?)  I will have to do some research and
> get back to you on that.
>
> Basically, security for a Wiki is usually handled by the Wiki, but since
> you've put added auth in front of it by going through mod-auth-kerb, it's
> something that the Wiki connector would have to understand (and emulate
> your browser) in order to implement.  So it does not likely support this
> right now.  It may be relatively easy to do or it may be a challenge -
> we'll see.  I would also be somewhat concerned that it may not possible to
> actually reach the API urls through Apache; that would make everything moot
> if it were true.  Could you confirm that you can visit API urls through
> your Apache setup?
>
> Karl
>
>
>
> On Thu, May 16, 2013 at 12:21 PM, TC Tobin-Campbell <TC@epic.com> wrote:
>
> Hi there,
>
> I'm trying to connect ManifoldCF to an internal wiki at my company. The
> ManifoldCF wiki connector supplies a username and password field for the
> wiki api, however, at my company, a username and password is required to
> connect to the apache server running the wiki site, and after that
> authentication takes place, those credentials are passed on to the wiki api.
>
>
>
> So, essentially, I need a way to have ManifoldCF pass my windows
> credentials on when trying to make its connection. Using the api login
> fields does not work.
>
>
>
> We use Kerberos the Kerberos Module for Apache<http://modauthkerb.sourceforge.net/index.html>(AuthType
Kerberos).  My understanding based on that linked documentation
> is that this module does use Basic Auth to communicate with the browser.
>
>
>
> Is there anything we can to make ManifoldCF authenticate in this scenario?
>
>
>
> Thanks,
>
>
>
>
>
> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
> 271-9000
>
>
>
> Sherlock <https://sherlock.epic.com/> (Issue tracking)
>
> Analyst Toolkits<https://sites.epic.com/epiclib/epicdoc/Pages/analyst/default.aspx>
> (Common setup and support tasks)
>
> Report Repository<https://documentation.epic.com/DataHandbook/Reports/ReportSearch>(Epic
reports documentation)
>
> Nova <https://nova.epic.com/Login/GetOrg.aspx?returnUrl=%2fdefault.aspx>(Release
note management)
>
> Galaxy <https://documentation.epic.com/OnlineDoc/Documents.aspx> (Epic
> documentation)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message