manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Mon, 16 Sep 2013 15:08:27 GMT
Hi Dmitry,

I've looked at the log in detail.  Unfortunately it is not complete enough
for me to be able to understand where exactly the logic is not matching
SharePoint AWS's responses.

Could you please do the following:

(a) Make sure the connection has "4.0 AWS" selected as the type.
(b) Turn on sufficient debugging so I can see all SOAP transactions.
Httpclient wire debugging is sufficient; there may be a setting that is
less verbose that will also do.  You will need to do this in the
logging.ini file.  There's a fair bit of online documentation for how to do
this.  While you are at it, delete the current log.
(c) DELETE the job you have been using, and recreate it, making sure to use
the same wildcards as before.  Otherwise it will be too confusing.
(d) Wait for the old job to clean up, then run the new job for a short
while (say, 1 minute).  Abort the job and send me the log.

There could be a number of issues - for example, it may well be the case
that in some situations the path actually is prepended to the results from
GetList, and in other situations not.  I will have to look at all the calls
before I can make any determination where the issues lie.


On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <> wrote:

> Hi Dmitry,
> Thanks for the additional information.  It does appear like the method
> that lists subsites is not working as expected under AWS.  Nor are some
> number of other methods which supposedly just list the children of a
> subsite.
> I've reopened CONNECTORS-772 to work on addressing this issue.  Please
> stay tuned.
> Karl
> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
>> wrote:
>> Hi Karl,
>> Most of the paths that get generated are listed in the attached log, they
>> match what shows up in the diag report. So I'm not sure where they diverge,
>> most of them just don't seem right.  There are 3 subsites rooted in the
>> main site: Abcd, Defghij, Klmnopqr.  It's strange that the connector would
>> try such paths as:
>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are multiple
>> repetitions of the same subsite on the path and to begin with, Defghij is
>> not a subsite of Klmnopqr, so why would it try this? the /// at the end
>> doesn't seem correct either, unless I'm missing something in how this
>> pathing works.
>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements --
>> looks wrong. A docname is mixed into the path, a subsite ends up after a
>> docname?...
>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same
>> types of issues plus now somehow the docname got split with a forward
>> slash?..
>> There are also a bunch of StringIndexOutOfBoundsException's.  Perhaps
>> this logic doesn't fit with the pathing we're seeing on this amz-based
>> installation?
>> I'd expect the logic to just know that root contains 3 subsites, and work
>> off that. Each subsite has a specific list of libraries and lists, etc. It
>> seems odd that the connector gets into this matching pattern, and tries
>> what looks like thousands of variations (I aborted the execution).
>> - Dmitry
>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <> wrote:
>>> Hi Dmitry,
>>> To clarify, the way you would need to analyze this is to run a crawl
>>> with the wildcards as you have selected, abort if necessary after a while,
>>> and then use the Document Status report to list the document identifiers
>>> that had been generated.  Find a document identifier that you believe
>>> represents a path that is illegal, and figure out what SOAP getChild call
>>> caused the problem by returning incorrect data.  In other words, find the
>>> point in the path where the path diverges from what exists into what
>>> doesn't exist, and go back in the ManifoldCF logs to find the particular
>>> SOAP request that led to the issue.
>>> I'd expect from your description that the problem lies with getting
>>> child sites given a site path, but that's just a guess at this point.
>>> Karl
>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <> wrote:
>>>> Hi Dmitry,
>>>> I don't understand what you mean by "I've tried the set of wildcards as
>>>> below and I seem to be running into a lot of cycles, where various subsite
>>>> folders are appended to each other and an extraction of data at all of
>>>> those locations is attempted".   If you are seeing cycles it means that
>>>> document discovery is still failing in some way.  For each
>>>> folder/library/site/subsite, only the children of that
>>>> folder/library/site/subsite should be appended to the path - ever.
>>>> If you can give a specific example, preferably including the soap
>>>> back-and-forth, that would be very helpful.
>>>> Karl
>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>>> wrote:
>>>>> Hi Karl,
>>>>> Quick question. Is there an easy way to configure an SP repo
>>>>> connection for crawling of all content, from the root site all the way
>>>>> I've tried the set of wildcards as below and I seem to be running into
>>>>> a lot of cycles, where various subsite folders are appended to each other
>>>>> and an extraction of data at all of those locations is attempted. Ideally
>>>>> I'd like to avoid having to construct an exact set of paths because the
>>>>> may change, especially with new content being added.
>>>>> Path rules:
>>>>> /* file include
>>>>> /* library include
>>>>> /* list include
>>>>> /* site include
>>>>> Metadata:
>>>>> /* include true
>>>>> I'd also like to pull down any files attached to list items. I'm
>>>>> hoping that some type of "/* file include" should do it, once I figure
>>>>> how to safely include all content.
>>>>> Thanks,
>>>>> - Dmitry

View raw message