Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AC47B18C9F for ; Mon, 22 Feb 2016 13:00:45 +0000 (UTC) Received: (qmail 72334 invoked by uid 500); 22 Feb 2016 12:39:26 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 69537 invoked by uid 500); 22 Feb 2016 12:39:21 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 68450 invoked by uid 99); 22 Feb 2016 12:32:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2016 12:32:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BC5A8C0C35 for ; Mon, 22 Feb 2016 12:32:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ssoQoUoPCKX3 for ; Mon, 22 Feb 2016 12:32:30 +0000 (UTC) Received: from mail-ig0-f171.google.com (mail-ig0-f171.google.com [209.85.213.171]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 952155F343 for ; Mon, 22 Feb 2016 12:32:29 +0000 (UTC) Received: by mail-ig0-f171.google.com with SMTP id y8so78845401igp.1 for ; Mon, 22 Feb 2016 04:32:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=zxGsX2HZgLZ0LPINWntezQhDMvLifB9mWhvhs92bk5g=; b=ZkKJ+QI3GPzg0YN7tHod9BHSjUu7TU755keuHpEZmDqrYNrrRRjMyG/2gmc/aswI6A AjM0F+EKGP0VLDoleWfniL0dW4AXtZ6S1oCK+Zwera/tlVXPA/9iiIpqwwi1pj0niePU wEmUE58dxmQSstE1Omj0tE59f7051hhfpYhwmZJF7zZbS/sVm3u4x476+/50sYiTZLwE WdBAUnWGzoP4MvTR/ZUxyevXDa3pc8Jn2Wz5J/Tg+wCh+S7VxjIAqVKNEOmAqdu+W5yD ZaQKzFYvG32a8m7NFQl+GnEl/+z5ikuF3US8BZwtWElDpiSqZzTmBVWQ4J4MWNV8gIx8 Huqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=zxGsX2HZgLZ0LPINWntezQhDMvLifB9mWhvhs92bk5g=; b=SCA843mX9Ao2Zt2/Eh3XpqHxFy0NrLcwUYRqkg+ChF1gI5u4hSZlAVhOZGZW2HH96z jFJBa0KKG7HTZf2Kp9Iv142JGOQGAMSzLOWDoDkX1szBNpEKcXVKtBY0gWF1UeXWIabx VCyWa0N/1oaO8c1RUY6F1BYuFATesRoMoi4TufiWxJnkUMuJloBCnV5a004UoAIX12NN BBtDjd+sGNnwt+ol1Qi5CF3um04GEDjY0WrBQZN4VtUnXblzArYO/WpnO0YLdAyauoyR WW3LAFzPTbmqOeKVr5xHG0hHvd1JmP6w8Dh+wk4xRVCOPntPRI5VX/Kb4x8BszMIbFNQ 3XvA== X-Gm-Message-State: AG10YOTdSQJvkJ1wl4KSJYyVzIq/cSRooxf+6WO9l6gaXK4qBCIaFqShJXOo9sosTSusAU/cyNaZ/ZzX4jLv/Q== MIME-Version: 1.0 X-Received: by 10.50.155.5 with SMTP id vs5mr10981070igb.83.1456144348980; Mon, 22 Feb 2016 04:32:28 -0800 (PST) Received: by 10.107.134.150 with HTTP; Mon, 22 Feb 2016 04:32:28 -0800 (PST) In-Reply-To: References: Date: Mon, 22 Feb 2016 07:32:28 -0500 Message-ID: Subject: Re: HTTP 302 error causing job to abort From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a11348e36a9afff052c5b0690 --001a11348e36a9afff052c5b0690 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Any news on this research? Karl On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright wrote: > Hi Phil, > > Thanks -- this information is more helpful. > > So my understanding is that there is an external site reference in your > site/subsite hierarchy? And the *root* site (the one that you point at > when you configure the connection itself) is *not* external after all? > > If that is the case, then the external site must be being "discovered" > through the Webs service API call. There are two ways forward: > > (1) We can change the Webs response parsing to detect external sites and > not include those in the crawl, or > (2) We can try to make decisions based on whether a 302 comes back as a > response code. > > (1) is by far the best approach but it will require some cooperation and > execution of sample code on your part. Essentially I'll need to see what > the xml is that is coming back that first describes the exterrnal site an= d > see if there is an attribute that lets us know it is external. That way = I > properly just skip it entirely. > > We can have a look at what comes back from SharePoint for this API > response if you enable connector debugging in properties.xml: > > > > ... and restart. You will then need to do a crawl. The following line > will be what you look for: > > Logging.connectors.debug("SharePoint: getSites xml response: > "+xmlResponse); > > This xml response will contain "Url" and "Title" nodes; what I need to > know is whether there's any attribute of the "Url" node, or parallel node > other than "Url" or "Title', that contains an indication of whether the U= rl > that describes the external site is indeed external. So you look for the > Url that describes the SharePoint URL that has the redirection, and tell = me > if there's anything special about it in the associated getSites response. > Does that make sense? > > If this is too hard, alternative (2) is possible, but it will require ton= s > of individual changes. So let's look into (1) first. > > Thanks > Karl > > > On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller < > priethmuller@funnelback.com> wrote: > >> Hi Karl, >> >> Some further info: >> >> - The problem document that Manifold reported, is redirecting to an >> external site. >> - We tried crawling a smaller subset of content on the same >> Sharepoint site that definitely doesn=E2=80=99t contain any external = links in the >> content, and this works OK. >> - The job that errors with the 302, says it has found 529 docs so far >> and processed 127 of them. This seems to indicate that is has in fact= found >> some documents. >> >> I=E2=80=99m not sure what you mean that the error is being generated fro= m the API >> call, and not an individual document? The info appears to indicate it is >> not all documents, but just selected documents. >> >> There really isn=E2=80=99t much we can do about this from the Sharepoint >> configuration side, is there any way we can test if it is as simple as t= he >> 302 coming from the documents themselves? >> >> Thanks for your help to date. >> >> Phil >> >> >> From: Karl Wright >> Reply-To: >> Date: Thursday, 18 February 2016 10:31 am >> >> To: "user@manifoldcf.apache.org" >> Subject: Re: HTTP 302 error causing job to abort >> >> Hi Phil, >> >> The 302 error is not coming from a single document. If it *was* coming >> from the fetch of an individual document, it would be easy to work aroun= d. >> But, from your stack trace, it is clear that this error is coming from a= n >> API call, specifically a call to enumerate subsites of a given site. Th= at >> means that some or all of the SharePoint hierarchy is not accessible >> through POST requests. I have never seen this kind of behavior from >> SharePoint before. >> >> This is not something that I can work around without more information. >> In order to get that information, you will at the very minimum need to t= urn >> on connector debugging, and probably turning on http wire debugging woul= d >> be helpful too. And, if what you said about the View page for this >> connection is true and it also shows a 302 error, I very much suspect th= at >> something changed on the server end and you are currently unable to craw= l >> *any* documents at all. >> >> I am sorry I cannot make this any clearer. >> >> Thanks, >> Karl >> >> >> >> >> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller < >> priethmuller@funnelback.com> wrote: >> >>> Hi Karl, >>> >>> Thanks for the update. >>> >>> I=E2=80=99m not 100% sure how many documents have this redirect in them= , but >>> I=E2=80=99ll see if I can get a better estimate. The content we are cra= wling is >>> substantially large, and comes from many different authors so it=E2=80= =99s >>> difficult to manage how these Sharepoint documents are created. It make= s it >>> extremely difficult to pinpoint all the documents that contain redirect= s. >>> >>> Am I correct in assuming a single 302 error causes the job to fail, or >>> is there some other logic that determines this? >>> >>> How plausible would it be to include in the product an option for >>> treating 302=E2=80=99s as a warning, rather than a fatal error? Possibl= y just an >>> option in the Job setup? >>> >>> Regards, >>> Phil >>> >>> >>> From: Karl Wright >>> Reply-To: >>> Date: Thursday, 18 February 2016 1:39 am >>> >>> To: "user@manifoldcf.apache.org" >>> Subject: Re: HTTP 302 error causing job to abort >>> >>> Hi again Phil, >>> >>> The HttpClient team points out that POST requests (as we do for the >>> SharePoint repository requests) are not allowed to follow 302 redirecti= ons >>> according to RFC2616. We use POST requests because, for SOAP, there is >>> often quite a bit of XML data that goes along with the request, and we >>> would otherwise have size issues. So we cannot use GET instead of POST= . >>> See CONNECTORS-1279 for details. >>> >>> If you still believe that it is only a couple of URLs that are returnin= g >>> 302 for you, I'd like some analysis of why you believe that to be true.= I >>> would be happy to consider recognition of an occasional 302 response as >>> meaning "skip this document". On the other hand, based on your stack >>> trace, it really appears that you have a far more systemic problem; it = is >>> failing while obtaining information for an entire site, so not much wou= ld >>> get crawled in that case. >>> >>> Thanks, >>> Karl >>> >>> >>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright wrote= : >>> >>>> Hi Phil, >>>> >>>> It is not surprising that the connector doesn't like 302 responses and >>>> doesn't know what to do with them, because it isn't supposed to ever b= e >>>> getting any of these. >>>> >>>> I am puzzled by your statement that "only a couple of documents have >>>> redirections in them", because the connector crawls Lists and Library >>>> documents within SharePoint *only*, and these are very specifically >>>> accessible through a SharePoint URL hierarchy structure. There's no r= oom >>>> in any of that for a 302 redirection. Since you see a 302 in the UI, = I >>>> feel pretty certain you have a problem with your configuration and it = is >>>> not just "a couple of documents". >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller < >>>> priethmuller@funnelback.com> wrote: >>>> >>>>> Thanks Karl, >>>>> >>>>> The majority of content is not going to the redirect, it=E2=80=99s pr= obably >>>>> just a handful of documents that are behaving this way. >>>>> >>>>> I=E2=80=99d agree that it=E2=80=99s of lesser concern whether or not = the document >>>>> itself is indexing, however I wouldn=E2=80=99t expect the 302 to be t= reated as a >>>>> fatal error that causes the job to come to a halt. I=E2=80=99d expect= the document >>>>> to be passed over, and the crawl to continue. >>>>> >>>>> Is the only solution at this point to remove the documents which >>>>> redirect to a 302 to get the crawl to run in full? >>>>> >>>>> Regards, >>>>> >>>>> *Phil Riethmuller* >>>>> Technical Consultant >>>>> >>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000 >>>>> *T* +61 2 9045 2882 | funnelback.com >>>>> >>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>> >>>>> Connect with us: LinkedIn >>>>> - *Twitter* >>>>> >>>>> >>>>> From: Karl Wright >>>>> Reply-To: >>>>> Date: Wednesday, 17 February 2016 8:58 am >>>>> >>>>> To: "user@manifoldcf.apache.org" >>>>> Subject: Re: HTTP 302 error causing job to abort >>>>> >>>>> Hi Phil, >>>>> >>>>> You probably want to point your SharePoint repository connection to >>>>> the proper server and site, and not rely on redirections. It's also >>>>> possible that you are missing the site entirely and the redirection y= ou are >>>>> seeing is taking you to some error page somewhere. >>>>> >>>>> I will be raising the question of redirections with the >>>>> HttpComponents/HttpClient team, since I see no obvious problems with = the >>>>> SharePoint connector code. However, if your connection is properly s= et up, >>>>> redirections should be unneeded. >>>>> >>>>> I would read the documentation on the Wiki page for debugging >>>>> SharePoint connections at the bottom of this page: >>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Conn= ections >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller < >>>>> priethmuller@funnelback.com> wrote: >>>>> >>>>>> Do you mean in the job status in the Manifold CF interface? >>>>>> >>>>>> The job status also shows the same: >>>>>> Error: Unexpected http error code 302 accessing SharePoint at : >>>>>> (302)HTTP/1.0 302 Found >>>>>> >>>>>> I agree, I wouldn=E2=80=99t of thought that the crawler would follow= any >>>>>> links or redirections. >>>>>> >>>>>> What sort of configurations could be incorrectly configured, that I >>>>>> could look at revising? >>>>>> >>>>>> Phil >>>>>> >>>>>> >>>>>> From: Karl Wright >>>>>> Reply-To: >>>>>> Date: Wednesday, 17 February 2016 8:45 am >>>>>> >>>>>> To: "user@manifoldcf.apache.org" >>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>> >>>>>> Thanks. >>>>>> >>>>>> When you view the repository connection in the UI, do you get a 302 >>>>>> error also? >>>>>> >>>>>> I have looked at the code; Httpclient is supposedly configured to >>>>>> honor redirections. Obviously it is not doing that, so I'll have to= dig >>>>>> deeper into why that is. On the other hand, I would not expect you = to be >>>>>> getting any redirections, unless you have configured your connection >>>>>> incorrectly. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller < >>>>>> priethmuller@funnelback.com> wrote: >>>>>> >>>>>>> Thanks Karl - >>>>>>> >>>>>>> I=E2=80=99ve replaced the actual URL with below, but here is = the stack >>>>>>> trace: >>>>>>> >>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception >>>>>>> tossed: Unexpected http error code 302 accessing SharePoint at : >>>>>>> (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>>>> Unexpected http error code 302 accessing SharePoint at : (302)= HTTP/1.0 >>>>>>> 302 Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.= getSites(SPSProxyHelper.java:2246) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepos= itory.processDocuments(SharePointRepository.java:1549) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.= java:399) >>>>>>> >>>>>>> Caused by: (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invo= ke(CommonsHTTPSender.java:201) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrat= egy.java:32) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) >>>>>>> >>>>>>> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invokeEngine(Call.java:2784) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2767) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2443) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2366) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:1812) >>>>>>> >>>>>>> at >>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection= (WebsSoapStub.java:854) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.= getSites(SPSProxyHelper.java:2161) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> *Phil Riethmuller* >>>>>>> Technical Consultant >>>>>>> >>>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000 >>>>>>> *T* +61 2 9045 2882 | funnelback.com >>>>>>> >>>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>>>> >>>>>>> Connect with us: LinkedIn >>>>>>> - *Twitter* >>>>>>> >>>>>>> >>>>>>> From: Karl Wright >>>>>>> Reply-To: >>>>>>> Date: Tuesday, 16 February 2016 6:54 pm >>>>>>> To: "user@manifoldcf.apache.org" >>>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>>> >>>>>>> Hi Phil, >>>>>>> >>>>>>> A HTTP 302 response is simply a redirection. It should not, by >>>>>>> itself, cause a job to abort. I would expect that to go by in wire= /http >>>>>>> logging, but you should not see it anywhere else. So it is not cle= ar to me >>>>>>> what you are really seeing here. >>>>>>> >>>>>>> Can you include an example stack trace from the manifoldcf log? >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller < >>>>>>> priethmuller@funnelback.com> wrote: >>>>>>> >>>>>>>> Hi - >>>>>>>> >>>>>>>> When crawling a Sharepoint repository, I=E2=80=99m receiving a HTT= P 302 >>>>>>>> error which is causing the manifold job to abort. How do I prevent= the >>>>>>>> crawler from aborting the job? >>>>>>>> >>>>>>>> I=E2=80=99m using v2.3 of Manifold with a postgres database. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Phil >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --001a11348e36a9afff052c5b0690 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Any news on this research?
Karl


On Fri, Feb 1= 9, 2016 at 12:46 AM, Karl Wright <daddywri@gmail.com> wrote= :
Hi Phil,

Thanks -- this information is more helpful.

= So my understanding is that there is an external site reference in your sit= e/subsite hierarchy?=C2=A0 And the *root* site (the one that you point at w= hen you configure the connection itself) is *not* external after all?
=

If that is the case, then the external site must be bei= ng "discovered" through the Webs service API call.=C2=A0 There ar= e two ways forward:

(1) =C2=A0We can change the We= bs response parsing to detect external sites and not include those in the c= rawl, or
(2) We can try to make decisions based on whether a 302 = comes back as a response code.

(1) is by far the b= est approach but it will require some cooperation and execution of sample c= ode on your part.=C2=A0 Essentially I'll need to see what the xml is th= at is coming back that first describes the exterrnal site and see if there = is an attribute that lets us know it is external.=C2=A0 That way I properly= just skip it entirely.

We can have a look at what= comes back from SharePoint for this API response if you enable connector d= ebugging in properties.xml:

<property name=3D&q= uot;org.apache.manifoldcf.connectors" value=3D"DEBUG"/>

... and restart.=C2=A0 You will then need to do a c= rawl.=C2=A0 The following line will be what you look for:

Logging.connectors.debug("SharePoint: getSites xml response: &= quot;+xmlResponse);

This xml response will con= tain "Url" and "Title" nodes; what I need to know is wh= ether there's any attribute of the "Url" node, or parallel no= de other than "Url" or "Title', that contains an indicat= ion of whether the Url that describes the external site is indeed external.= =C2=A0 So you look for the Url that describes the SharePoint URL that has t= he redirection, and tell me if there's anything special about it in the= associated getSites response.=C2=A0 Does that make sense?

If this is too hard, alternative (2) is possible, but it will requ= ire tons of individual changes.=C2=A0 So let's look into (1) first.

Thanks
Karl


On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller &l= t;priethmu= ller@funnelback.com> wrote:
Hi Karl,

Some f= urther info:
  • The problem document that Manifold reported, is r= edirecting to an external site.
  • We tried crawling a smaller subset = of content on the same Sharepoint site that definitely doesn=E2=80=99t cont= ain any external links in the content, and this works OK.=C2=A0
  • The= job that errors with the 302, says it has found 529 docs so far and proces= sed 127 of them. This seems to indicate that is has in fact found some docu= ments.
I=E2=80=99m not sure what you mean that the error is b= eing generated from the API call, and not an individual document? The info = appears to indicate it is not all documents, but just selected documents.= =C2=A0

There really isn=E2=80=99t much we can do a= bout this from the Sharepoint configuration side, is there any way we can t= est if it is as simple as the 302 coming from the documents themselves?

Thanks for your help to date.

Phil



Hi Phil,

The 302 error is not coming from a= single document.=C2=A0 If it *was* coming from the fetch of an individual = document, it would be easy to work around.=C2=A0 But, from your stack trace= , it is clear that this error is coming from an API call, specifically a ca= ll to enumerate subsites of a given site.=C2=A0 That means that some or all= of the SharePoint hierarchy is not accessible through POST requests.=C2=A0= I have never seen this kind of behavior from SharePoint before.
=
This is not something that I can work around without more in= formation.=C2=A0 In order to get that information, you will at the very min= imum need to turn on connector debugging, and probably turning on http wire= debugging would be helpful too.=C2=A0 And, if what you said about the View= page for this connection is true and it also shows a 302 error, I very muc= h suspect that something changed on the server end and you are currently un= able to crawl *any* documents at all.

I am sorry I= cannot make this any clearer.

Thanks,
K= arl




On Wed, Feb 17, 2016 at 6:20 PM,= Phil Riethmuller <priethmuller@funnelback.com> wr= ote:
Hi Karl,

Thanks for the update.

I=E2=80=99m not 100% sure how many documents have thi= s redirect in them, but I=E2=80=99ll see if I can get a better estimate. Th= e content we are crawling is substantially large, and comes from many diffe= rent authors so it=E2=80=99s difficult to manage how these Sharepoint docum= ents are created. It makes it extremely difficult to pinpoint all the docum= ents that contain redirects.

Am I correct in assuming a single 302 error causes the = job to fail, or is there some other logic that determines this?

How plausible would = it be to include in the product an option for treating 302=E2=80=99s as a w= arning, rather than a fatal error? Possibly just an option in the Job setup= ?

<= /div>
Regards= ,
Phil<= /div>


From: Karl Wright <daddywri@gmail.com>
Reply-To: <user@manifoldcf.apache.org>
=
Date: Thursday, 18 February= 2016 1:39 am

To: &qu= ot;user@man= ifoldcf.apache.org" <user@manifoldcf.apache.org>
Subject: Re: HTTP 302 error causing job to abort<= br>

Hi again Phi= l,

The HttpClient team points out that POST requests (as= we do for the SharePoint repository requests) are not allowed to follow 30= 2 redirections according to RFC2616.=C2=A0 We use POST requests because, fo= r SOAP, there is often quite a bit of XML data that goes along with the req= uest, and we would otherwise have size issues.=C2=A0 So we cannot use GET i= nstead of POST.=C2=A0 See CONNECTORS-1279 for details.

=
If you still believe that it is only a couple of URLs that are returni= ng 302 for you, I'd like some analysis of why you believe that to be tr= ue.=C2=A0 I would be happy to consider recognition of an occasional 302 res= ponse as meaning "skip this document".=C2=A0 On the other hand, b= ased on your stack trace, it really appears that you have a far more system= ic problem; it is failing while obtaining information for an entire site, s= o not much would get crawled in that case.

Thanks,=
Karl


On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da= ddywri@gmail.com> wrote:
Hi Phil,

It is not surprising that the co= nnector doesn't like 302 responses and doesn't know what to do with= them, because it isn't supposed to ever be getting any of these.
=

I am puzzled by your statement that "only a couple= of documents have redirections in them", because the connector crawls= Lists and Library documents within SharePoint *only*, and these are very s= pecifically accessible through a SharePoint URL hierarchy structure.=C2=A0 = There's no room in any of that for a 302 redirection.=C2=A0 Since you s= ee a 302 in the UI, I feel pretty certain you have a problem with your conf= iguration and it is not just "a couple of documents".
=

Karl


On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller = <prieth= muller@funnelback.com> wrote:
Thanks Karl,

= The majority of content is not going to the redirect, it=E2=80=99s probably= just a handful of documents that are behaving this way.

I=E2=80=99d agree that it=E2=80=99s of lesser concern whether or not= the document itself is indexing, however I wouldn=E2=80=99t expect the 302= to be treated as a fatal error that causes the job to come to a halt. I=E2= =80=99d expect the document to be passed over, and the crawl to continue.

Is the only solution at this point to remove the do= cuments which redirect to a 302 to get the crawl to run in full?

Regards,

Phil Riethmuller
Technical Consulta= nt
= =C2=A0
Funnelback |=C2=A0437 Kent Street, Sydney, NSW 2000
AUSTRALI= A=C2=A0| UNITED KINGDOM | NEW ZEALAND | POL= AND | UNITED STATES

= Connect with us:=C2=A0LinkedIn=C2=A0-=C2=A0Twitter


From: Karl Wright <daddywri@gmail.com>
Reply-To: <user@manifoldcf.apache.org>
Date: Wednesday, 17 February 2016 8:58 am

To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Sub= ject: Re: HTTP 302 error causing job to abort
=

Hi Phil,

You p= robably want to point your SharePoint repository connection to the proper s= erver and site, and not rely on redirections.=C2=A0 It's also possible = that you are missing the site entirely and the redirection you are seeing i= s taking you to some error page somewhere.

I will = be raising the question of redirections with the HttpComponents/HttpClient = team, since I see no obvious problems with the SharePoint connector code.= =C2=A0 However, if your connection is properly set up, redirections should = be unneeded.

I would read the documentation on the= Wiki page for debugging SharePoint connections at the bottom of this page:= =C2=A0https://cwiki.apache.org/confluence/d= isplay/CONNECTORS/Debugging+Connections

Thanks= ,
Karl


<= div class=3D"gmail_quote">On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller= <priethmuller@funnelback.com> wrote:
Do you = mean in the job status in the Manifold CF interface?=C2=A0

The job status also shows the same:
Error:= Unexpected http error code 302 accessing SharePoint at <url>: (302)H= TTP/1.0 302 Found

I= agree,=C2=A0I wouldn=E2=80=99t of thought that the crawler would follow an= y links or redirections.

What sort of configuratio= ns could be incorrectly configured, that I could look at revising?
=
Phil

<= /div>

From: Karl Wright <daddywri@gmail.com>
Reply-To: <user@manifoldcf.apache.org>
Date: Wednesday, 17 February = 2016 8:45 am

To: &quo= t;user@mani= foldcf.apache.org" <user@manifoldcf.apache.org>
Subject: Re: HTTP 302 error causing job to abort

Thanks.
<= br>
When you view the repository connection in the UI, do you get= a 302 error also?

I have looked at the code; Http= client is supposedly configured to honor redirections.=C2=A0 Obviously it i= s not doing that, so I'll have to dig deeper into why that is.=C2=A0 On= the other hand, I would not expect you to be getting any redirections, unl= ess you have configured your connection incorrectly.

Karl


On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <priethmuller@funnelback.com> wrote:
Thanks Karl -
I=E2=80=99ve replaced the actual URL with <URL> below, bu= t here is the stack trace:

ERROR 2016-02-16 12:10:55,251 (Worker thr= ead '16') - Exception tossed: Unexpected http error code 302 access= ing SharePoint at <URL>: (302)HTTP/1.0 302 Found

org.apache.manifoldcf.core.interfac= es.ManifoldCFException: Unexpected http error code 302 accessing SharePoint= at <URL>: (302)HTTP/1.0 302 Found

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.manifol= dcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.ja= va:2246)

=C2=A0= =C2=A0 =C2=A0 =C2=A0 at org.apache.manifoldcf.crawler.connectors.sharepoin= t.SharePointRepository.processDocuments(SharePointRepository.java:1549)

=

=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerT= hread.java:399)

Caused by: (302)HTTP/1.0 302 Found

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.manifoldcf.c= onnectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)<= /p>

=C2=A0 =C2=A0 = =C2=A0 =C2=A0 at org.apache.axis.strategies.InvocationStrategy.visit(Invoca= tionStrategy.java:32)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.SimpleChain.doVisitin= g(SimpleChain.java:118)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.SimpleChain.invoke(= SimpleChain.java:83)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.AxisClient.invo= ke(AxisClient.java:165)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.Call.invokeE= ngine(Call.java:2784)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Ca= ll.java:2767)

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Call.java= :2443)

=C2=A0 = =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Call.java:2366)<= /p>

=C2=A0 =C2=A0 = =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Call.java:1812)

=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollecti= on(WebsSoapStub.java:854)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.manifoldcf.crawler.con= nectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)




Regards,

<= /b>
= = Phil Riethmuller
Technical Consultant
=C2=A0

Connect with us:=C2=A0= LinkedIn=C2=A0-=C2=A0Twitter


=

From: Karl Wright <daddywri@gmail.com>
Reply-To: <user@manifoldcf.apache.org>
Date: Tuesday, 16 February 2016 6:54 pm
To: "user@manifoldcf.apache.org" <= user@manifo= ldcf.apache.org>
Subject: Re: HTTP 302 error causing job to abort

Hi Phil,

A HTTP 302 response is simply= a redirection.=C2=A0 It should not, by itself, cause a job to abort.=C2=A0= I would expect that to go by in wire/http logging, but you should not see = it anywhere else.=C2=A0 So it is not clear to me what you are really seeing= here.

Can you include an example stack trace from= the manifoldcf log?

Karl
=C2=A0

On Tue, Feb 1= 6, 2016 at 12:22 AM, Phil Riethmuller <priethmuller@funnelback.c= om> wrote:
Hi -

When crawling a Sharepoint repository, I=E2=80=99m recei= ving a HTTP 302 error which is causing the manifold job to abort. How do I = prevent the crawler from aborting the job?

I=E2=80=99m using v2.3 of Manifold with a= postgres database.

Regards,
= Phil


=



<= /div>



--001a11348e36a9afff052c5b0690--