manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: HTTP 302 error causing job to abort
Date Wed, 17 Feb 2016 23:31:28 GMT
Hi Phil,

The 302 error is not coming from a single document.  If it *was* coming
from the fetch of an individual document, it would be easy to work around.
But, from your stack trace, it is clear that this error is coming from an
API call, specifically a call to enumerate subsites of a given site.  That
means that some or all of the SharePoint hierarchy is not accessible
through POST requests.  I have never seen this kind of behavior from
SharePoint before.

This is not something that I can work around without more information.  In
order to get that information, you will at the very minimum need to turn on
connector debugging, and probably turning on http wire debugging would be
helpful too.  And, if what you said about the View page for this connection
is true and it also shows a 302 error, I very much suspect that something
changed on the server end and you are currently unable to crawl *any*
documents at all.

I am sorry I cannot make this any clearer.

Thanks,
Karl




On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Hi Karl,
>
> Thanks for the update.
>
> I’m not 100% sure how many documents have this redirect in them, but I’ll
> see if I can get a better estimate. The content we are crawling is
> substantially large, and comes from many different authors so it’s
> difficult to manage how these Sharepoint documents are created. It makes it
> extremely difficult to pinpoint all the documents that contain redirects.
>
> Am I correct in assuming a single 302 error causes the job to fail, or is
> there some other logic that determines this?
>
> How plausible would it be to include in the product an option for treating
> 302’s as a warning, rather than a fatal error? Possibly just an option in
> the Job setup?
>
> Regards,
> Phil
>
>
> From: Karl Wright <daddywri@gmail.com>
> Reply-To: <user@manifoldcf.apache.org>
> Date: Thursday, 18 February 2016 1:39 am
>
> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Hi again Phil,
>
> The HttpClient team points out that POST requests (as we do for the
> SharePoint repository requests) are not allowed to follow 302 redirections
> according to RFC2616.  We use POST requests because, for SOAP, there is
> often quite a bit of XML data that goes along with the request, and we
> would otherwise have size issues.  So we cannot use GET instead of POST.
> See CONNECTORS-1279 for details.
>
> If you still believe that it is only a couple of URLs that are returning
> 302 for you, I'd like some analysis of why you believe that to be true.  I
> would be happy to consider recognition of an occasional 302 response as
> meaning "skip this document".  On the other hand, based on your stack
> trace, it really appears that you have a far more systemic problem; it is
> failing while obtaining information for an entire site, so not much would
> get crawled in that case.
>
> Thanks,
> Karl
>
>
> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Phil,
>>
>> It is not surprising that the connector doesn't like 302 responses and
>> doesn't know what to do with them, because it isn't supposed to ever be
>> getting any of these.
>>
>> I am puzzled by your statement that "only a couple of documents have
>> redirections in them", because the connector crawls Lists and Library
>> documents within SharePoint *only*, and these are very specifically
>> accessible through a SharePoint URL hierarchy structure.  There's no room
>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>> feel pretty certain you have a problem with your configuration and it is
>> not just "a couple of documents".
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Thanks Karl,
>>>
>>> The majority of content is not going to the redirect, it’s probably just
>>> a handful of documents that are behaving this way.
>>>
>>> I’d agree that it’s of lesser concern whether or not the document itself
>>> is indexing, however I wouldn’t expect the 302 to be treated as a fatal
>>> error that causes the job to come to a halt. I’d expect the document to be
>>> passed over, and the crawl to continue.
>>>
>>> Is the only solution at this point to remove the documents which
>>> redirect to a 302 to get the crawl to run in full?
>>>
>>> Regards,
>>>
>>> *Phil Riethmuller*
>>> Technical Consultant
>>>
>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>
>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>
>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
-
>>>  *Twitter*
>>>
>>>
>>> From: Karl Wright <daddywri@gmail.com>
>>> Reply-To: <user@manifoldcf.apache.org>
>>> Date: Wednesday, 17 February 2016 8:58 am
>>>
>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Hi Phil,
>>>
>>> You probably want to point your SharePoint repository connection to the
>>> proper server and site, and not rely on redirections.  It's also possible
>>> that you are missing the site entirely and the redirection you are seeing
>>> is taking you to some error page somewhere.
>>>
>>> I will be raising the question of redirections with the
>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>> SharePoint connector code.  However, if your connection is properly set up,
>>> redirections should be unneeded.
>>>
>>> I would read the documentation on the Wiki page for debugging SharePoint
>>> connections at the bottom of this page:
>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>>> priethmuller@funnelback.com> wrote:
>>>
>>>> Do you mean in the job status in the Manifold CF interface?
>>>>
>>>> The job status also shows the same:
>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>> (302)HTTP/1.0 302 Found
>>>>
>>>> I agree, I wouldn’t of thought that the crawler would follow any links
>>>> or redirections.
>>>>
>>>> What sort of configurations could be incorrectly configured, that I
>>>> could look at revising?
>>>>
>>>> Phil
>>>>
>>>>
>>>> From: Karl Wright <daddywri@gmail.com>
>>>> Reply-To: <user@manifoldcf.apache.org>
>>>> Date: Wednesday, 17 February 2016 8:45 am
>>>>
>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>
>>>> Thanks.
>>>>
>>>> When you view the repository connection in the UI, do you get a 302
>>>> error also?
>>>>
>>>> I have looked at the code; Httpclient is supposedly configured to honor
>>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>>> into why that is.  On the other hand, I would not expect you to be getting
>>>> any redirections, unless you have configured your connection incorrectly.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>>> priethmuller@funnelback.com> wrote:
>>>>
>>>>> Thanks Karl -
>>>>>
>>>>> I’ve replaced the actual URL with <URL> below, but here is the
stack
>>>>> trace:
>>>>>
>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>>>> 302 Found
>>>>>
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
302 Found
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>>
>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>>
>>>>>         at
>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>>
>>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>
>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>
>>>>>         at
>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>
>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>
>>>>>         at
>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Phil Riethmuller*
>>>>> Technical Consultant
>>>>>
>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>
>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>  - *Twitter*
>>>>>
>>>>>
>>>>> From: Karl Wright <daddywri@gmail.com>
>>>>> Reply-To: <user@manifoldcf.apache.org>
>>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>>> To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>> A HTTP 302 response is simply a redirection.  It should not, by
>>>>> itself, cause a job to abort.  I would expect that to go by in wire/http
>>>>> logging, but you should not see it anywhere else.  So it is not clear
to me
>>>>> what you are really seeing here.
>>>>>
>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>>> priethmuller@funnelback.com> wrote:
>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302
error
>>>>>> which is causing the manifold job to abort. How do I prevent the
crawler
>>>>>> from aborting the job?
>>>>>>
>>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>>
>>>>>> Regards,
>>>>>> Phil
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message