manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Parkinson <parkinson.w...@gmail.com>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Wed, 18 Sep 2013 12:53:16 GMT
Hi Dmitry

Just out of interest, what does the following command output on your system

cd to C:\inetpub\adminscripts

*cscript adsutil.vbs get w3svc/<put your sharepoint website number
here>/root/NTAuthenticationProviders*

Cheers,

Will


On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright <daddywri@gmail.com> wrote:

> "This is the second time I'm encountering the issue which leads me to
> believe it's a quirk of IIS and/or SharePoint."
>
> It cannot be just a quirk of SharePoint because SharePoint's UI etc could
> not create or work with subsites if that was true.  It may well be a
> configuration issue with IIS, which is indeed what I suspect.  I have
> pinged all the resources I know of to try and get some insight as to why
> this is happening.
>
>
> "Perhaps this is something that can be worked into the 'fabric' of
> ManifoldCF as a workaround for a known issue."
>
> Like I said before, this is a huge amount of work, tantamount to rewriting
> most of the connector.  If this is what you want to request, that is your
> option, but there is no way we'd complete any of this work before
> December/January at the earliest.
>
>
> "Just to understand this a bit better, the main breakage here is that the
> wildcards don't work properly, right? "
>
> No, it means that ManifoldCF cannot get at any data of any kind associated
> with a SharePoint subsite.  Accessing root data works fine.  If you try to
> crawl as things are now, you must disable all subsites and just crawl the
> root site, or you will crawl the same things with longer and longer paths
> indefinitely.
>
> Karl
>
>
>
>
>
> On Wed, Sep 18, 2013 at 8:38 AM, Dmitry Goldenberg <dgoldenberg@kmwllc.com
> > wrote:
>
>> Karl,
>>
>> This is the second time I'm encountering the issue which leads me to
>> believe it's a quirk of IIS and/or SharePoint. Perhaps this is something
>> that can be worked into the 'fabric' of ManifoldCF as a workaround for a
>> known issue. I understand that it may have far reaching tenticles but I
>> wonder if that's really the only option...
>>
>> Just to understand this a bit better, the main breakage here is that the
>> wildcards don't work properly, right?  In theory if I have a repo connector
>> config which lists specific library and list paths, things should work?
>> It's only when the /* types of wildcards are included, we're in trouble?
>>
>> - Dmitry
>>
>>
>> On Wed, Sep 18, 2013 at 8:07 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Someone else was having a similar problem. See
>>> http://social.technet.microsoft.com/Forums/sharepoint/en-US/e4b53c63-b89a-4356-a7b0-6ca7bfd22826/getting-sharepoint-subsite-from-custom-webservice.
>>>
>>> Apparently it does depend on how you get to the web service, which does
>>> argue that it is an IIS issue.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Sep 17, 2013 at 5:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> As discussed privately I had a look at your system.  What is happening
>>>> is that the C# static SPContext.Current.Web is not reflecting the subsite
>>>> in any url that contains a subsite.  In other words, the URL coming in
>>>> might be "http://servername/subsite1/_vti_bin/MCPermissions.asmx", but
>>>> the MCPermissions.asmx plugin will think it is being executed in the root
>>>> context ("http://servername").  That's pretty broken behavior, so I'm
>>>> guessing that the problem is that either IIS or SharePoint is somehow
>>>> misconfigured to do this, and the web services would then begin to work
>>>> right again.  But I have no idea how this should actually be fixed.
>>>>
>>>> Will Parkinson, one of the subscribers of this list, may find the
>>>> symptoms meaningful, since he set up an AWS SharePoint instance before.  I
>>>> hope he will respond in a helpful way.  Until then, I think we are stuck.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Sep 17, 2013 at 9:49 AM, Dmitry Goldenberg <
>>>> dgoldenberg@kmwllc.com> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> It looks like I'll be able to get access for you to the test system
>>>>> we're using. Would you be interested in working with the system directly? I
>>>>> certainly don't mind doing some testing but I thought we'd speed things up
>>>>> this way. If so, could you email me from a more private account so we can
>>>>> set this up?
>>>>>
>>>>> Thanks,
>>>>> - Dmitry
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 17, 2013 at 7:38 AM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> Another interesting bit from the log:
>>>>>>
>>>>>> >>>>>>
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/_catalogs/lt/Forms/AllItems.aspx', 'List Template Gallery'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master Page
>>>>>> Gallery'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/Shared Documents/Forms/AllItems.aspx', 'Shared Documents'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solution Gallery'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/Style Library/Forms/AllItems.aspx', 'Style Library'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library 1'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme Gallery'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Gallery'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Checking whether to include library
>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exactly
>>>>>> matched rule path '/*'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Checking whether to include library
>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' exactly
>>>>>> matched rule path '/*'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Checking whether to include library
>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly matched
>>>>>> rule path '/*'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Checking whether to include library
>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' exactly
>>>>>> matched rule path '/*'
>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:
>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
>>>>>> <<<<<<
>>>>>>
>>>>>> This time it appears that it is the Lists service that is broken and
>>>>>> does not recognize the parent site.
>>>>>>
>>>>>> I haven't corrected this problem yet since now I am beginning to
>>>>>> wonder if *any* of the web services under Amazon work at all for subsites.
>>>>>> We may be better off implementing everything we need in the MCPermissions
>>>>>> service.  I will ponder this as I continue to research the logs.
>>>>>>
>>>>>> It's still valuable to check my getSites() implementation.  I'll be
>>>>>> doing another round of work tonight on the plugin.
>>>>>>
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>
>>>>>>> The augmented plugin can be downloaded from
>>>>>>> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp.  The revised connector code is also ready, and should be checked out and
>>>>>>> built from
>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772.
>>>>>>>
>>>>>>> Once you set it all up, you can see if it is doing the right thing
>>>>>>> by just trying to drill down through subsites in the UI.  You should always
>>>>>>> see a list of subsites that is appropriate for the context you are in; if
>>>>>>> this does not happen it is not working.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Goldenberg <
>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>
>>>>>>>> Karl,
>>>>>>>>
>>>>>>>> I can see how preloading the list of subsites may be less optimal..
>>>>>>>> The advantage of doing it this way is one call and you've got the structure
>>>>>>>> in memory, which may be OK unless there are sites with a ton of subsites
>>>>>>>> which may stress out memory. The disadvantage is having to throw this
>>>>>>>> structure around..
>>>>>>>>
>>>>>>>> Yes, I'll certainly help test out your changes, just let me know
>>>>>>>> when they're available.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> - Dmitry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hi Dmitry,
>>>>>>>>>
>>>>>>>>> Thanks for the code snippet.  I'd prefer, though, to not preload
>>>>>>>>> the entire site structure in memory.  Probably it would be better to just
>>>>>>>>> add another method to the ManifoldCF SharePoint 2010 plugin.  More methods
>>>>>>>>> are going to be added anyway to support Claim Space Authentication, so I
>>>>>>>>> guess this would be just one more.
>>>>>>>>>
>>>>>>>>> We honestly have never seen this problem before - so it's not just
>>>>>>>>> flakiness, it has something to do with the installation, I'm certain.  At
>>>>>>>>> any rate, I'll get going right away on a workaround - if you are willing to
>>>>>>>>> test what I produce.  I'm also certain there is at least one other issue,
>>>>>>>>> but hopefully that will become clearer once this one is resolved.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenberg <
>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>
>>>>>>>>>> Karl,
>>>>>>>>>>
>>>>>>>>>> >> subsite discovery is effectively disabled except directly
>>>>>>>>>> under the root site
>>>>>>>>>>
>>>>>>>>>> Yes. Come to think of it, I once came across this problem while
>>>>>>>>>> implementing a SharePoint connector.  I'm not sure whether it's exactly
>>>>>>>>>> what's happening with the issue we're discussing but looks like it.
>>>>>>>>>>
>>>>>>>>>> I started off by using multiple getWebCollection calls to get
>>>>>>>>>> child subsites of sites and trying to navigate down that way. The problem
>>>>>>>>>> was that getWebCollection was always returning the immediate subsites of
>>>>>>>>>> the root site no matter whether you're at the root or below, so I ended up
>>>>>>>>>> generating infinite loops.
>>>>>>>>>>
>>>>>>>>>> I switched over to using a single getAllSubWebCollection call and
>>>>>>>>>> caching its results. That call returns the full list of all subsites as
>>>>>>>>>> pairs of Title and Url.  I had a POJO similar to the one below which held
>>>>>>>>>> the list of sites and contained logic for enumerating the child sites,
>>>>>>>>>> given the URL of a (parent) site.  From what I recall, getWebCollection
>>>>>>>>>> works inconsistently, either across SP versions or across installations,
>>>>>>>>>> but the logic below should work in any case.
>>>>>>>>>>
>>>>>>>>>> *** public class SubSiteCollection -- holds a list of CrawledSite
>>>>>>>>>> pojo's each of which is a { title, url }.
>>>>>>>>>>
>>>>>>>>>> *** SubSiteCollection has the following:
>>>>>>>>>>
>>>>>>>>>>  public List<CrawledSite> getImmediateSubSites(String siteUrl) {
>>>>>>>>>>   List<CrawledSite> subSites = new ArrayList<CrawledSite>();
>>>>>>>>>>   for (CrawledSite site : sites) {
>>>>>>>>>>    if (isChildOf(siteUrl, site.getUrl().toString())) {
>>>>>>>>>>     subSites.add(site);
>>>>>>>>>>    }
>>>>>>>>>>   }
>>>>>>>>>>   return subSites;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  private static boolean isChildOf(String parentUrl, String
>>>>>>>>>> urlToCheck) {
>>>>>>>>>>   final String parent = normalizeUrl(parentUrl);
>>>>>>>>>>   final String child = normalizeUrl(urlToCheck);
>>>>>>>>>>   boolean ret = false;
>>>>>>>>>>   if (child.startsWith(parent)) {
>>>>>>>>>>    String remainder = child.substring(parent.length());
>>>>>>>>>>    ret = StringUtils.countOccurrencesOf(remainder, SLASH) == 1;
>>>>>>>>>>   }
>>>>>>>>>>   return ret;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  private static String normalizeUrl(String url) {
>>>>>>>>>>   return ((url.endsWith(SLASH)) ? url : url +
>>>>>>>>>> SLASH).toLowerCase();
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> - Dmitry
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>
>>>>>>>>>>> Have a look at this sequence also:
>>>>>>>>>>>
>>>>>>>>>>> >>>>>>
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Subsite list: '
>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', 'Abcd'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Subsite list: '
>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij',
>>>>>>>>>>> 'Defghij'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Subsite list: '
>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr',
>>>>>>>>>>> 'Klmnopqr'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path '/*'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path '/*'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rule path '/*'
>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>>>>>>>
>>>>>>>>>>> <<<<<<
>>>>>>>>>>>
>>>>>>>>>>> This is using the GetSites(String parent) method with a site
>>>>>>>>>>> name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites (!!).
>>>>>>>>>>> The parent path is not correct, obviously, but nevertheless this one way in
>>>>>>>>>>> which paths are getting completely messed up.  It *looks* like the Webs web
>>>>>>>>>>> service is broken in such a way as to ignore the URL coming in, except for
>>>>>>>>>>> the base part, which means that subsite discovery is effectively disabled
>>>>>>>>>>> except directly under the root site.
>>>>>>>>>>>
>>>>>>>>>>> This might still be OK if it is not possible to create subsites
>>>>>>>>>>> of subsites in this version of SharePoint.  Can you confirm that this is or
>>>>>>>>>>> is not possible?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> "This is everything that got generated, from the very beginning"
>>>>>>>>>>>>
>>>>>>>>>>>> Well, something isn't right.  What I expect to see that I don't
>>>>>>>>>>>> right up front are:
>>>>>>>>>>>>
>>>>>>>>>>>> - A webs "getWebCollection" invocation for /_vti_bin/webs.asmx
>>>>>>>>>>>> - Two lists "getListCollection" invocations for
>>>>>>>>>>>> /_vti_bin/lists.asmx
>>>>>>>>>>>>
>>>>>>>>>>>> Instead the first transactions I see are from already busted
>>>>>>>>>>>> URLs - which make no sense since there would be no way they should have
>>>>>>>>>>>> been able to get queued yet.
>>>>>>>>>>>>
>>>>>>>>>>>> So there are a number of possibilities.  First, maybe the log
>>>>>>>>>>>> isn't getting cleared out, and the session in question therefore starts
>>>>>>>>>>>> somewhere in the middle of manifoldcf.log.1.  But no:
>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
>>>>>>>>>>>> grep: input lines truncated - result questionable
>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>
>>>>>>>>>>>> Nevertheless there are some interesting points here.  First,
>>>>>>>>>>>> note the following response, which I've been able to determine is against
>>>>>>>>>>>> "Test Library 1":
>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>> SharePoint: getListItems xml response: '<GetListItems xmlns="
>>>>>>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/"><GetListItemsResponse
>>>>>>>>>>>> xmlns=""><GetListItemsResult
>>>>>>>>>>>> FileRef="SitePages/Home.aspx"/></GetListItemsResponse></GetListItems>'
>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>> SharePoint: Checking whether to include document '/SitePages/Home.aspx'
>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>> SharePoint: File '/SitePages/Home.aspx' exactly matched rule path '/*'
>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>> SharePoint: Including file '/SitePages/Home.aspx'
>>>>>>>>>>>>  WARN 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>> Sharepoint: Unexpected relPath structure; path is '/SitePages/Home.aspx',
>>>>>>>>>>>> but expected <list/library> length of 26
>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>
>>>>>>>>>>>> The FileRef in this case is pointing at what, exactly?  Is
>>>>>>>>>>>> there a SitePages/Home.aspx in the "Test Library 1" library?  Or does it
>>>>>>>>>>>> mean to refer back to the root site with this URL construction?  And since
>>>>>>>>>>>> this is supposedly at the root level, how come the combined site + library
>>>>>>>>>>>> name comes out to 26??  I get 15, which leaves 11 characters unaccounted
>>>>>>>>>>>> for.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm still looking at the logs to see if I can glean key
>>>>>>>>>>>> information.  Later, if I could set up a crawl against the sharepoint
>>>>>>>>>>>> instance in question, that would certainly help.  I can readily set up an
>>>>>>>>>>>> ssh tunnel if that is what is required.  But I won't be able to do it until
>>>>>>>>>>>> I get home tonight.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg <
>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is everything that got generated, from the very
>>>>>>>>>>>>> beginning, meaning that I did a fresh build, new database, new connection
>>>>>>>>>>>>> definitions, start. The log must have rolled but the .1 log is included.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I were to get you access to the actual test system, would
>>>>>>>>>>>>> you mind taking a look? It may be more efficient than sending logs..
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> These logs are different but have exactly the same problem;
>>>>>>>>>>>>>> they start in the middle when the crawl is already well underway.  I'm
>>>>>>>>>>>>>> wondering if by chance you have more than one agents process running or
>>>>>>>>>>>>>> something?  Or maybe the log is rolling and stuff is getting lost?  What's
>>>>>>>>>>>>>> there is not what I would expect to see, at all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I *did* manage to find two transactions that look like they
>>>>>>>>>>>>>> might be helpful, but because the *results* of those transactions are
>>>>>>>>>>>>>> required by transactions that take place minutes *before* in the log, I
>>>>>>>>>>>>>> have no confidence that I'm looking at anything meaningful.  But I'll get
>>>>>>>>>>>>>> back to you on what I find nonetheless.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you decide repeat this exercise, try watching the log with
>>>>>>>>>>>>>> "tail -f" before starting the job.  You should not see any log contents at
>>>>>>>>>>>>>> all until the job is started.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Attached please find logs which start at the beginning. I
>>>>>>>>>>>>>>> started from a fresh build (clean db etc.), the logs start at server start,
>>>>>>>>>>>>>>> then I create the output connection and the repo connection, then the job,
>>>>>>>>>>>>>>> and then I fire off the job. I aborted the execution about a minute into it
>>>>>>>>>>>>>>> or so.  That's all that's in the logs with:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG
>>>>>>>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are you sure these are the right logs?
>>>>>>>>>>>>>>>> - They start right in the middle of a crawl
>>>>>>>>>>>>>>>> - They are already in a broken state when they start, e.g.
>>>>>>>>>>>>>>>> the kinds of things that are being looked up are already nonsense paths
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I need to see logs from the BEGINNING of a fresh crawl to
>>>>>>>>>>>>>>>> see how the nonsense paths happen.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've generated logs with details as we discussed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The job was created afresh, as before:
>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>> The logs are attached.
>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "Do you think that this issue is generic with regard to
>>>>>>>>>>>>>>>>>> any Amz instance?"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I presume so, since you didn't apparently do anything
>>>>>>>>>>>>>>>>>> special to set one of these up.  Unfortunately, such instances are not part
>>>>>>>>>>>>>>>>>> of the free tier, so I am still constrained from setting one up for myself
>>>>>>>>>>>>>>>>>> because of household rules here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "For now, I assume our only workaround is to list the
>>>>>>>>>>>>>>>>>> paths of interest manually"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Depending on what is going wrong, that may not even
>>>>>>>>>>>>>>>>>> work.  It looks like several SharePoint web service calls may be affected,
>>>>>>>>>>>>>>>>>> and not in a cleanly predictable way, for this to happen.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "is identification and extraction of attachments
>>>>>>>>>>>>>>>>>> supported in the SP connector?"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ManifoldCF in general leaves identification and
>>>>>>>>>>>>>>>>>> extraction to the search engine.  Solr, for instance uses Tika for this, if
>>>>>>>>>>>>>>>>>> so configured.  You can configure your Solr output connection to include or
>>>>>>>>>>>>>>>>>> exclude specific mime types or extensions if you want to limit what is
>>>>>>>>>>>>>>>>>> attempted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks, Karl. Do you think that this issue is generic
>>>>>>>>>>>>>>>>>>> with regard to any Amz instance? I'm just wondering how easily reproducible
>>>>>>>>>>>>>>>>>>> this may be..
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For now, I assume our only workaround is to list the
>>>>>>>>>>>>>>>>>>> paths of interest manually, i.e. add explicit rules for each library and
>>>>>>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> A related subject - is identification and extraction of
>>>>>>>>>>>>>>>>>>> attachments supported in the SP connector?  E.g. if I have a Word doc
>>>>>>>>>>>>>>>>>>> attached to a Task list item, would that be extracted?  So far, I see that
>>>>>>>>>>>>>>>>>>> library content gets crawled and I'm getting the list item data but am not
>>>>>>>>>>>>>>>>>>> sure what happens to the attachments.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the additional information.  It does appear
>>>>>>>>>>>>>>>>>>>> like the method that lists subsites is not working as expected under AWS.
>>>>>>>>>>>>>>>>>>>> Nor are some number of other methods which supposedly just list the
>>>>>>>>>>>>>>>>>>>> children of a subsite.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've reopened CONNECTORS-772 to work on addressing this
>>>>>>>>>>>>>>>>>>>> issue.  Please stay tuned.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Most of the paths that get generated are listed in the
>>>>>>>>>>>>>>>>>>>>> attached log, they match what shows up in the diag report. So I'm not sure
>>>>>>>>>>>>>>>>>>>>> where they diverge, most of them just don't seem right.  There are 3
>>>>>>>>>>>>>>>>>>>>> subsites rooted in the main site: Abcd, Defghij, Klmnopqr.  It's strange
>>>>>>>>>>>>>>>>>>>>> that the connector would try such paths as:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// --
>>>>>>>>>>>>>>>>>>>>> there are multiple repetitions of the same subsite on the path and to begin
>>>>>>>>>>>>>>>>>>>>> with, Defghij is not a subsite of Klmnopqr, so why would it try this? the
>>>>>>>>>>>>>>>>>>>>> /// at the end doesn't seem correct either, unless I'm missing something in
>>>>>>>>>>>>>>>>>>>>> how this pathing works.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /Test Library
>>>>>>>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- looks wrong. A
>>>>>>>>>>>>>>>>>>>>> docname is mixed into the path, a subsite ends up after a docname?...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /Shared
>>>>>>>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of
>>>>>>>>>>>>>>>>>>>>> issues plus now somehow the docname got split with a forward slash?..
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are also a bunch of
>>>>>>>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's.  Perhaps this logic doesn't fit with the
>>>>>>>>>>>>>>>>>>>>> pathing we're seeing on this amz-based installation?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'd expect the logic to just know that root contains 3
>>>>>>>>>>>>>>>>>>>>> subsites, and work off that. Each subsite has a specific list of libraries
>>>>>>>>>>>>>>>>>>>>> and lists, etc. It seems odd that the connector gets into this matching
>>>>>>>>>>>>>>>>>>>>> pattern, and tries what looks like thousands of variations (I aborted the
>>>>>>>>>>>>>>>>>>>>> execution).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> To clarify, the way you would need to analyze this is
>>>>>>>>>>>>>>>>>>>>>> to run a crawl with the wildcards as you have selected, abort if necessary
>>>>>>>>>>>>>>>>>>>>>> after a while, and then use the Document Status report to list the document
>>>>>>>>>>>>>>>>>>>>>> identifiers that had been generated.  Find a document identifier that you
>>>>>>>>>>>>>>>>>>>>>> believe represents a path that is illegal, and figure out what SOAP
>>>>>>>>>>>>>>>>>>>>>> getChild call caused the problem by returning incorrect data.  In other
>>>>>>>>>>>>>>>>>>>>>> words, find the point in the path where the path diverges from what exists
>>>>>>>>>>>>>>>>>>>>>> into what doesn't exist, and go back in the ManifoldCF logs to find the
>>>>>>>>>>>>>>>>>>>>>> particular SOAP request that led to the issue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'd expect from your description that the problem
>>>>>>>>>>>>>>>>>>>>>> lies with getting child sites given a site path, but that's just a guess at
>>>>>>>>>>>>>>>>>>>>>> this point.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I don't understand what you mean by "I've tried the
>>>>>>>>>>>>>>>>>>>>>>> set of wildcards as below and I seem to be running into a lot of cycles,
>>>>>>>>>>>>>>>>>>>>>>> where various subsite folders are appended to each other and an extraction
>>>>>>>>>>>>>>>>>>>>>>> of data at all of those locations is attempted".   If you are seeing cycles
>>>>>>>>>>>>>>>>>>>>>>> it means that document discovery is still failing in some way.  For each
>>>>>>>>>>>>>>>>>>>>>>> folder/library/site/subsite, only the children of that
>>>>>>>>>>>>>>>>>>>>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If you can give a specific example, preferably
>>>>>>>>>>>>>>>>>>>>>>> including the soap back-and-forth, that would be very helpful.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Quick question. Is there an easy way to configure
>>>>>>>>>>>>>>>>>>>>>>>> an SP repo connection for crawling of all content, from the root site all
>>>>>>>>>>>>>>>>>>>>>>>> the way down?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've tried the set of wildcards as below and I seem
>>>>>>>>>>>>>>>>>>>>>>>> to be running into a lot of cycles, where various subsite folders are
>>>>>>>>>>>>>>>>>>>>>>>> appended to each other and an extraction of data at all of those locations
>>>>>>>>>>>>>>>>>>>>>>>> is attempted. Ideally I'd like to avoid having to construct an exact set of
>>>>>>>>>>>>>>>>>>>>>>>> paths because the set may change, especially with new content being added.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'd also like to pull down any files attached to
>>>>>>>>>>>>>>>>>>>>>>>> list items. I'm hoping that some type of "/* file include" should do it,
>>>>>>>>>>>>>>>>>>>>>>>> once I figure out how to safely include all content.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message