Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0D9BB104F4 for ; Wed, 18 Sep 2013 13:28:34 +0000 (UTC) Received: (qmail 81853 invoked by uid 500); 18 Sep 2013 13:28:30 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 81711 invoked by uid 500); 18 Sep 2013 13:28:30 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 81697 invoked by uid 99); 18 Sep 2013 13:28:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 13:28:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of parkinson.will@gmail.com designates 74.125.82.45 as permitted sender) Received: from [74.125.82.45] (HELO mail-wg0-f45.google.com) (74.125.82.45) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 13:28:23 +0000 Received: by mail-wg0-f45.google.com with SMTP id y10so6558197wgg.12 for ; Wed, 18 Sep 2013 06:28:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ziY0hJEmaX34nf3Y6rDEdCZrioP4POfHqFZQMY8nzow=; b=lHc7+oZEcqwZVxl5zKbrZUDLScEaACwdP7IBUsv3868ii4OUWYtGqSLCYYKOjvGS/X T7D0fYE6gTpybZX5jxIlKABphobqO8oZ7qC2QsWuyuAcg+RVogG1uqfw+vhnECqtbQkR wgvwHeuanaBqmS4681L5pUN7xer8yTa0j7Z1bfkYAsE1/Tdwco/pMVaas4tilnUhywef +tZ3WlzkBWOnwES4D1UeWeiAb2PbhoGR/8WsXANfJZxbGUjNXXGxtkN7GvdWCxRNxKrX SCWJf+CE7ssh8m3CVPaPsXZHVoOXLeogq1kYU7wmRsXWEtOo/OLrobUgIDdeOwzF1DRw VxFw== MIME-Version: 1.0 X-Received: by 10.194.222.2 with SMTP id qi2mr32394209wjc.14.1379510881995; Wed, 18 Sep 2013 06:28:01 -0700 (PDT) Received: by 10.180.78.41 with HTTP; Wed, 18 Sep 2013 06:28:01 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Sep 2013 23:28:01 +1000 Message-ID: Subject: Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed From: Will Parkinson To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary=001a11c2781415d7da04e6a8683f X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2781415d7da04e6a8683f Content-Type: text/plain; charset=ISO-8859-1 Yes that's right, only really interested in the site that you are trying to crawl On Wed, Sep 18, 2013 at 11:25 PM, Dmitry Goldenberg wrote: > Will, > > For SharePoint - 80, the output is > > NTAuthenticationProviders : (STRING) "NTLM" > > I assume we're not interested in the Default Web Site; for that, the > output is simply "The parameter NTAuthenticationProviders is not set at > this node." > > - Dmitry > > > On Wed, Sep 18, 2013 at 9:16 AM, Will Parkinson wrote: > >> If you open IIS manager and click on sites, it is displayed in the ID >> column (see screenshot attached) >> >> >> On Wed, Sep 18, 2013 at 10:55 PM, Dmitry Goldenberg < >> dgoldenberg@kmwllc.com> wrote: >> >>> **Hi Will, >>> Sorry, what is the "sharepoint website *number*" in that invokation? >>> - Dmitry >>> >>> >>> On Wed, Sep 18, 2013 at 8:53 AM, Will Parkinson < >>> parkinson.will@gmail.com> wrote: >>> >>>> Hi Dmitry >>>> >>>> Just out of interest, what does the following command output on your >>>> system >>>> >>>> cd to C:\inetpub\adminscripts >>>> >>>> *cscript adsutil.vbs get w3svc/>>> here>/root/NTAuthenticationProviders* >>>> >>>> Cheers, >>>> >>>> Will >>>> >>>> >>>> On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright wrote: >>>> >>>>> "This is the second time I'm encountering the issue which leads me to >>>>> believe it's a quirk of IIS and/or SharePoint." >>>>> >>>>> It cannot be just a quirk of SharePoint because SharePoint's UI etc >>>>> could not create or work with subsites if that was true. It may well be a >>>>> configuration issue with IIS, which is indeed what I suspect. I have >>>>> pinged all the resources I know of to try and get some insight as to why >>>>> this is happening. >>>>> >>>>> >>>>> "Perhaps this is something that can be worked into the 'fabric' of >>>>> ManifoldCF as a workaround for a known issue." >>>>> >>>>> Like I said before, this is a huge amount of work, tantamount to >>>>> rewriting most of the connector. If this is what you want to request, that >>>>> is your option, but there is no way we'd complete any of this work before >>>>> December/January at the earliest. >>>>> >>>>> >>>>> "Just to understand this a bit better, the main breakage here is that >>>>> the wildcards don't work properly, right? " >>>>> >>>>> No, it means that ManifoldCF cannot get at any data of any kind >>>>> associated with a SharePoint subsite. Accessing root data works fine. If >>>>> you try to crawl as things are now, you must disable all subsites and just >>>>> crawl the root site, or you will crawl the same things with longer and >>>>> longer paths indefinitely. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Sep 18, 2013 at 8:38 AM, Dmitry Goldenberg < >>>>> dgoldenberg@kmwllc.com> wrote: >>>>> >>>>>> Karl, >>>>>> >>>>>> This is the second time I'm encountering the issue which leads me to >>>>>> believe it's a quirk of IIS and/or SharePoint. Perhaps this is something >>>>>> that can be worked into the 'fabric' of ManifoldCF as a workaround for a >>>>>> known issue. I understand that it may have far reaching tenticles but I >>>>>> wonder if that's really the only option... >>>>>> >>>>>> Just to understand this a bit better, the main breakage here is that >>>>>> the wildcards don't work properly, right? In theory if I have a repo >>>>>> connector config which lists specific library and list paths, things should >>>>>> work? It's only when the /* types of wildcards are included, we're in >>>>>> trouble? >>>>>> >>>>>> - Dmitry >>>>>> >>>>>> >>>>>> On Wed, Sep 18, 2013 at 8:07 AM, Karl Wright wrote: >>>>>> >>>>>>> Hi Dmitry, >>>>>>> >>>>>>> Someone else was having a similar problem. See >>>>>>> http://social.technet.microsoft.com/Forums/sharepoint/en-US/e4b53c63-b89a-4356-a7b0-6ca7bfd22826/getting-sharepoint-subsite-from-custom-webservice. >>>>>>> >>>>>>> Apparently it does depend on how you get to the web service, which >>>>>>> does argue that it is an IIS issue. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 17, 2013 at 5:44 PM, Karl Wright wrote: >>>>>>> >>>>>>>> Hi Dmitry, >>>>>>>> >>>>>>>> As discussed privately I had a look at your system. What is >>>>>>>> happening is that the C# static SPContext.Current.Web is not reflecting the >>>>>>>> subsite in any url that contains a subsite. In other words, the URL coming >>>>>>>> in might be "http://servername/subsite1/_vti_bin/MCPermissions.asmx", >>>>>>>> but the MCPermissions.asmx plugin will think it is being executed in the >>>>>>>> root context ("http://servername"). That's pretty broken >>>>>>>> behavior, so I'm guessing that the problem is that either IIS or SharePoint >>>>>>>> is somehow misconfigured to do this, and the web services would then begin >>>>>>>> to work right again. But I have no idea how this should actually be fixed. >>>>>>>> >>>>>>>> Will Parkinson, one of the subscribers of this list, may find the >>>>>>>> symptoms meaningful, since he set up an AWS SharePoint instance before. I >>>>>>>> hope he will respond in a helpful way. Until then, I think we are stuck. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 17, 2013 at 9:49 AM, Dmitry Goldenberg < >>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>> >>>>>>>>> Hi Karl, >>>>>>>>> >>>>>>>>> It looks like I'll be able to get access for you to the test >>>>>>>>> system we're using. Would you be interested in working with the system >>>>>>>>> directly? I certainly don't mind doing some testing but I thought we'd >>>>>>>>> speed things up this way. If so, could you email me from a more private >>>>>>>>> account so we can set this up? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> - Dmitry >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 17, 2013 at 7:38 AM, Karl Wright wrote: >>>>>>>>> >>>>>>>>>> Hi Dmitry, >>>>>>>>>> >>>>>>>>>> Another interesting bit from the log: >>>>>>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/_catalogs/lt/Forms/AllItems.aspx', 'List Template Gallery' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master Page >>>>>>>>>> Gallery' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/Shared Documents/Forms/AllItems.aspx', 'Shared Documents' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solution Gallery' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/Style Library/Forms/AllItems.aspx', 'Style Library' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library 1' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme Gallery' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Gallery' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Checking whether to include library >>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exactly >>>>>>>>>> matched rule path '/*' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Checking whether to include library >>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' exactly >>>>>>>>>> matched rule path '/*' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Checking whether to include library >>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly matched >>>>>>>>>> rule path '/*' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Checking whether to include library >>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' exactly >>>>>>>>>> matched rule path '/*' >>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: >>>>>>>>>> Including library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' >>>>>>>>>> <<<<<< >>>>>>>>>> >>>>>>>>>> This time it appears that it is the Lists service that is broken >>>>>>>>>> and does not recognize the parent site. >>>>>>>>>> >>>>>>>>>> I haven't corrected this problem yet since now I am beginning to >>>>>>>>>> wonder if *any* of the web services under Amazon work at all for subsites. >>>>>>>>>> We may be better off implementing everything we need in the MCPermissions >>>>>>>>>> service. I will ponder this as I continue to research the logs. >>>>>>>>>> >>>>>>>>>> It's still valuable to check my getSites() implementation. I'll >>>>>>>>>> be doing another round of work tonight on the plugin. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright wrote: >>>>>>>>>> >>>>>>>>>>> The augmented plugin can be downloaded from >>>>>>>>>>> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp. The revised connector code is also ready, and should be checked out and >>>>>>>>>>> built from >>>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772. >>>>>>>>>>> >>>>>>>>>>> Once you set it all up, you can see if it is doing the right >>>>>>>>>>> thing by just trying to drill down through subsites in the UI. You should >>>>>>>>>>> always see a list of subsites that is appropriate for the context you are >>>>>>>>>>> in; if this does not happen it is not working. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Goldenberg < >>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Karl, >>>>>>>>>>>> >>>>>>>>>>>> I can see how preloading the list of subsites may be less >>>>>>>>>>>> optimal.. The advantage of doing it this way is one call and you've got the >>>>>>>>>>>> structure in memory, which may be OK unless there are sites with a ton of >>>>>>>>>>>> subsites which may stress out memory. The disadvantage is having to throw >>>>>>>>>>>> this structure around.. >>>>>>>>>>>> >>>>>>>>>>>> Yes, I'll certainly help test out your changes, just let me >>>>>>>>>>>> know when they're available. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> - Dmitry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright < >>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the code snippet. I'd prefer, though, to not >>>>>>>>>>>>> preload the entire site structure in memory. Probably it would be better >>>>>>>>>>>>> to just add another method to the ManifoldCF SharePoint 2010 plugin. More >>>>>>>>>>>>> methods are going to be added anyway to support Claim Space Authentication, >>>>>>>>>>>>> so I guess this would be just one more. >>>>>>>>>>>>> >>>>>>>>>>>>> We honestly have never seen this problem before - so it's not >>>>>>>>>>>>> just flakiness, it has something to do with the installation, I'm certain. >>>>>>>>>>>>> At any rate, I'll get going right away on a workaround - if you are willing >>>>>>>>>>>>> to test what I produce. I'm also certain there is at least one other >>>>>>>>>>>>> issue, but hopefully that will become clearer once this one is resolved. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenberg < >>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> subsite discovery is effectively disabled except directly >>>>>>>>>>>>>> under the root site >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. Come to think of it, I once came across this problem >>>>>>>>>>>>>> while implementing a SharePoint connector. I'm not sure whether it's >>>>>>>>>>>>>> exactly what's happening with the issue we're discussing but looks like it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I started off by using multiple getWebCollection calls to get >>>>>>>>>>>>>> child subsites of sites and trying to navigate down that way. The problem >>>>>>>>>>>>>> was that getWebCollection was always returning the immediate subsites of >>>>>>>>>>>>>> the root site no matter whether you're at the root or below, so I ended up >>>>>>>>>>>>>> generating infinite loops. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I switched over to using a single getAllSubWebCollection call >>>>>>>>>>>>>> and caching its results. That call returns the full list of all subsites as >>>>>>>>>>>>>> pairs of Title and Url. I had a POJO similar to the one below which held >>>>>>>>>>>>>> the list of sites and contained logic for enumerating the child sites, >>>>>>>>>>>>>> given the URL of a (parent) site. From what I recall, getWebCollection >>>>>>>>>>>>>> works inconsistently, either across SP versions or across installations, >>>>>>>>>>>>>> but the logic below should work in any case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *** public class SubSiteCollection -- holds a list of >>>>>>>>>>>>>> CrawledSite pojo's each of which is a { title, url }. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *** SubSiteCollection has the following: >>>>>>>>>>>>>> >>>>>>>>>>>>>> public List getImmediateSubSites(String >>>>>>>>>>>>>> siteUrl) { >>>>>>>>>>>>>> List subSites = new ArrayList(); >>>>>>>>>>>>>> for (CrawledSite site : sites) { >>>>>>>>>>>>>> if (isChildOf(siteUrl, site.getUrl().toString())) { >>>>>>>>>>>>>> subSites.add(site); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> return subSites; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> private static boolean isChildOf(String parentUrl, String >>>>>>>>>>>>>> urlToCheck) { >>>>>>>>>>>>>> final String parent = normalizeUrl(parentUrl); >>>>>>>>>>>>>> final String child = normalizeUrl(urlToCheck); >>>>>>>>>>>>>> boolean ret = false; >>>>>>>>>>>>>> if (child.startsWith(parent)) { >>>>>>>>>>>>>> String remainder = child.substring(parent.length()); >>>>>>>>>>>>>> ret = StringUtils.countOccurrencesOf(remainder, SLASH) == >>>>>>>>>>>>>> 1; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> return ret; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> private static String normalizeUrl(String url) { >>>>>>>>>>>>>> return ((url.endsWith(SLASH)) ? url : url + >>>>>>>>>>>>>> SLASH).toLowerCase(); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright < >>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Have a look at this sequence also: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Subsite list: ' >>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', 'Abcd' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Subsite list: ' >>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij', >>>>>>>>>>>>>>> 'Defghij' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Subsite list: ' >>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr', >>>>>>>>>>>>>>> 'Klmnopqr' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Checking whether to include site >>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule >>>>>>>>>>>>>>> path '/*' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Checking whether to include site >>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched >>>>>>>>>>>>>>> rule path '/*' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Checking whether to include site >>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched >>>>>>>>>>>>>>> rule path '/*' >>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - >>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is using the GetSites(String parent) method with a site >>>>>>>>>>>>>>> name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites (!!). >>>>>>>>>>>>>>> The parent path is not correct, obviously, but nevertheless this one way in >>>>>>>>>>>>>>> which paths are getting completely messed up. It *looks* like the Webs web >>>>>>>>>>>>>>> service is broken in such a way as to ignore the URL coming in, except for >>>>>>>>>>>>>>> the base part, which means that subsite discovery is effectively disabled >>>>>>>>>>>>>>> except directly under the root site. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This might still be OK if it is not possible to create >>>>>>>>>>>>>>> subsites of subsites in this version of SharePoint. Can you confirm that >>>>>>>>>>>>>>> this is or is not possible? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright < >>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> "This is everything that got generated, from the very >>>>>>>>>>>>>>>> beginning" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Well, something isn't right. What I expect to see that I >>>>>>>>>>>>>>>> don't right up front are: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - A webs "getWebCollection" invocation for >>>>>>>>>>>>>>>> /_vti_bin/webs.asmx >>>>>>>>>>>>>>>> - Two lists "getListCollection" invocations for >>>>>>>>>>>>>>>> /_vti_bin/lists.asmx >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Instead the first transactions I see are from already >>>>>>>>>>>>>>>> busted URLs - which make no sense since there would be no way they should >>>>>>>>>>>>>>>> have been able to get queued yet. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So there are a number of possibilities. First, maybe the >>>>>>>>>>>>>>>> log isn't getting cleared out, and the session in question therefore starts >>>>>>>>>>>>>>>> somewhere in the middle of manifoldcf.log.1. But no: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1 >>>>>>>>>>>>>>>> grep: input lines truncated - result questionable >>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Nevertheless there are some interesting points here. >>>>>>>>>>>>>>>> First, note the following response, which I've been able to determine is >>>>>>>>>>>>>>>> against "Test Library 1": >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - >>>>>>>>>>>>>>>> SharePoint: getListItems xml response: '>>>>>>>>>>>>>>> xmlns="">>>>>>>>>>>>>>>> FileRef="SitePages/Home.aspx"/>' >>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - >>>>>>>>>>>>>>>> SharePoint: Checking whether to include document '/SitePages/Home.aspx' >>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - >>>>>>>>>>>>>>>> SharePoint: File '/SitePages/Home.aspx' exactly matched rule path '/*' >>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - >>>>>>>>>>>>>>>> SharePoint: Including file '/SitePages/Home.aspx' >>>>>>>>>>>>>>>> WARN 2013-09-16 13:02:31,590 (Worker thread '23') - >>>>>>>>>>>>>>>> Sharepoint: Unexpected relPath structure; path is '/SitePages/Home.aspx', >>>>>>>>>>>>>>>> but expected length of 26 >>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The FileRef in this case is pointing at what, exactly? Is >>>>>>>>>>>>>>>> there a SitePages/Home.aspx in the "Test Library 1" library? Or does it >>>>>>>>>>>>>>>> mean to refer back to the root site with this URL construction? And since >>>>>>>>>>>>>>>> this is supposedly at the root level, how come the combined site + library >>>>>>>>>>>>>>>> name comes out to 26?? I get 15, which leaves 11 characters unaccounted >>>>>>>>>>>>>>>> for. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm still looking at the logs to see if I can glean key >>>>>>>>>>>>>>>> information. Later, if I could set up a crawl against the sharepoint >>>>>>>>>>>>>>>> instance in question, that would certainly help. I can readily set up an >>>>>>>>>>>>>>>> ssh tunnel if that is what is required. But I won't be able to do it until >>>>>>>>>>>>>>>> I get home tonight. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg < >>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is everything that got generated, from the very >>>>>>>>>>>>>>>>> beginning, meaning that I did a fresh build, new database, new connection >>>>>>>>>>>>>>>>> definitions, start. The log must have rolled but the .1 log is included. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If I were to get you access to the actual test system, >>>>>>>>>>>>>>>>> would you mind taking a look? It may be more efficient than sending logs.. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright < >>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> These logs are different but have exactly the same >>>>>>>>>>>>>>>>>> problem; they start in the middle when the crawl is already well underway. >>>>>>>>>>>>>>>>>> I'm wondering if by chance you have more than one agents process running or >>>>>>>>>>>>>>>>>> something? Or maybe the log is rolling and stuff is getting lost? What's >>>>>>>>>>>>>>>>>> there is not what I would expect to see, at all. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I *did* manage to find two transactions that look like >>>>>>>>>>>>>>>>>> they might be helpful, but because the *results* of those transactions are >>>>>>>>>>>>>>>>>> required by transactions that take place minutes *before* in the log, I >>>>>>>>>>>>>>>>>> have no confidence that I'm looking at anything meaningful. But I'll get >>>>>>>>>>>>>>>>>> back to you on what I find nonetheless. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you decide repeat this exercise, try watching the log >>>>>>>>>>>>>>>>>> with "tail -f" before starting the job. You should not see any log >>>>>>>>>>>>>>>>>> contents at all until the job is started. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg < >>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Attached please find logs which start at the beginning. >>>>>>>>>>>>>>>>>>> I started from a fresh build (clean db etc.), the logs start at server >>>>>>>>>>>>>>>>>>> start, then I create the output connection and the repo connection, then >>>>>>>>>>>>>>>>>>> the job, and then I fire off the job. I aborted the execution about a >>>>>>>>>>>>>>>>>>> minute into it or so. That's all that's in the logs with: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG >>>>>>>>>>>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright < >>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Are you sure these are the right logs? >>>>>>>>>>>>>>>>>>>> - They start right in the middle of a crawl >>>>>>>>>>>>>>>>>>>> - They are already in a broken state when they start, >>>>>>>>>>>>>>>>>>>> e.g. the kinds of things that are being looked up are already nonsense paths >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I need to see logs from the BEGINNING of a fresh crawl >>>>>>>>>>>>>>>>>>>> to see how the nonsense paths happen. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I've generated logs with details as we discussed. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The job was created afresh, as before: >>>>>>>>>>>>>>>>>>>>> Path rules: >>>>>>>>>>>>>>>>>>>>> /* file include >>>>>>>>>>>>>>>>>>>>> /* library include >>>>>>>>>>>>>>>>>>>>> /* list include >>>>>>>>>>>>>>>>>>>>> /* site include >>>>>>>>>>>>>>>>>>>>> Metadata: >>>>>>>>>>>>>>>>>>>>> /* include true >>>>>>>>>>>>>>>>>>>>> The logs are attached. >>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright < >>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> "Do you think that this issue is generic with regard >>>>>>>>>>>>>>>>>>>>>> to any Amz instance?" >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I presume so, since you didn't apparently do anything >>>>>>>>>>>>>>>>>>>>>> special to set one of these up. Unfortunately, such instances are not part >>>>>>>>>>>>>>>>>>>>>> of the free tier, so I am still constrained from setting one up for myself >>>>>>>>>>>>>>>>>>>>>> because of household rules here. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> "For now, I assume our only workaround is to list the >>>>>>>>>>>>>>>>>>>>>> paths of interest manually" >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Depending on what is going wrong, that may not even >>>>>>>>>>>>>>>>>>>>>> work. It looks like several SharePoint web service calls may be affected, >>>>>>>>>>>>>>>>>>>>>> and not in a cleanly predictable way, for this to happen. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> "is identification and extraction of attachments >>>>>>>>>>>>>>>>>>>>>> supported in the SP connector?" >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ManifoldCF in general leaves identification and >>>>>>>>>>>>>>>>>>>>>> extraction to the search engine. Solr, for instance uses Tika for this, if >>>>>>>>>>>>>>>>>>>>>> so configured. You can configure your Solr output connection to include or >>>>>>>>>>>>>>>>>>>>>> exclude specific mime types or extensions if you want to limit what is >>>>>>>>>>>>>>>>>>>>>> attempted. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, Karl. Do you think that this issue is >>>>>>>>>>>>>>>>>>>>>>> generic with regard to any Amz instance? I'm just wondering how easily >>>>>>>>>>>>>>>>>>>>>>> reproducible this may be.. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> For now, I assume our only workaround is to list the >>>>>>>>>>>>>>>>>>>>>>> paths of interest manually, i.e. add explicit rules for each library and >>>>>>>>>>>>>>>>>>>>>>> list. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> A related subject - is identification and extraction >>>>>>>>>>>>>>>>>>>>>>> of attachments supported in the SP connector? E.g. if I have a Word doc >>>>>>>>>>>>>>>>>>>>>>> attached to a Task list item, would that be extracted? So far, I see that >>>>>>>>>>>>>>>>>>>>>>> library content gets crawled and I'm getting the list item data but am not >>>>>>>>>>>>>>>>>>>>>>> sure what happens to the attachments. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the additional information. It does >>>>>>>>>>>>>>>>>>>>>>>> appear like the method that lists subsites is not working as expected under >>>>>>>>>>>>>>>>>>>>>>>> AWS. Nor are some number of other methods which supposedly just list the >>>>>>>>>>>>>>>>>>>>>>>> children of a subsite. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I've reopened CONNECTORS-772 to work on addressing >>>>>>>>>>>>>>>>>>>>>>>> this issue. Please stay tuned. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Most of the paths that get generated are listed in >>>>>>>>>>>>>>>>>>>>>>>>> the attached log, they match what shows up in the diag report. So I'm not >>>>>>>>>>>>>>>>>>>>>>>>> sure where they diverge, most of them just don't seem right. There are 3 >>>>>>>>>>>>>>>>>>>>>>>>> subsites rooted in the main site: Abcd, Defghij, Klmnopqr. It's strange >>>>>>>>>>>>>>>>>>>>>>>>> that the connector would try such paths as: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// >>>>>>>>>>>>>>>>>>>>>>>>> -- there are multiple repetitions of the same subsite on the path and to >>>>>>>>>>>>>>>>>>>>>>>>> begin with, Defghij is not a subsite of Klmnopqr, so why would it try >>>>>>>>>>>>>>>>>>>>>>>>> this? the /// at the end doesn't seem correct either, unless I'm missing >>>>>>>>>>>>>>>>>>>>>>>>> something in how this pathing works. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> /Test Library >>>>>>>>>>>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- looks wrong. A >>>>>>>>>>>>>>>>>>>>>>>>> docname is mixed into the path, a subsite ends up after a docname?... >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> /Shared >>>>>>>>>>>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of >>>>>>>>>>>>>>>>>>>>>>>>> issues plus now somehow the docname got split with a forward slash?.. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> There are also a bunch of >>>>>>>>>>>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's. Perhaps this logic doesn't fit with the >>>>>>>>>>>>>>>>>>>>>>>>> pathing we're seeing on this amz-based installation? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I'd expect the logic to just know that root >>>>>>>>>>>>>>>>>>>>>>>>> contains 3 subsites, and work off that. Each subsite has a specific list of >>>>>>>>>>>>>>>>>>>>>>>>> libraries and lists, etc. It seems odd that the connector gets into this >>>>>>>>>>>>>>>>>>>>>>>>> matching pattern, and tries what looks like thousands of variations (I >>>>>>>>>>>>>>>>>>>>>>>>> aborted the execution). >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> To clarify, the way you would need to analyze >>>>>>>>>>>>>>>>>>>>>>>>>> this is to run a crawl with the wildcards as you have selected, abort if >>>>>>>>>>>>>>>>>>>>>>>>>> necessary after a while, and then use the Document Status report to list >>>>>>>>>>>>>>>>>>>>>>>>>> the document identifiers that had been generated. Find a document >>>>>>>>>>>>>>>>>>>>>>>>>> identifier that you believe represents a path that is illegal, and figure >>>>>>>>>>>>>>>>>>>>>>>>>> out what SOAP getChild call caused the problem by returning incorrect >>>>>>>>>>>>>>>>>>>>>>>>>> data. In other words, find the point in the path where the path diverges >>>>>>>>>>>>>>>>>>>>>>>>>> from what exists into what doesn't exist, and go back in the ManifoldCF >>>>>>>>>>>>>>>>>>>>>>>>>> logs to find the particular SOAP request that led to the issue. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I'd expect from your description that the problem >>>>>>>>>>>>>>>>>>>>>>>>>> lies with getting child sites given a site path, but that's just a guess at >>>>>>>>>>>>>>>>>>>>>>>>>> this point. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't understand what you mean by "I've tried >>>>>>>>>>>>>>>>>>>>>>>>>>> the set of wildcards as below and I seem to be running into a lot of >>>>>>>>>>>>>>>>>>>>>>>>>>> cycles, where various subsite folders are appended to each other and an >>>>>>>>>>>>>>>>>>>>>>>>>>> extraction of data at all of those locations is attempted". If you are >>>>>>>>>>>>>>>>>>>>>>>>>>> seeing cycles it means that document discovery is still failing in some >>>>>>>>>>>>>>>>>>>>>>>>>>> way. For each folder/library/site/subsite, only the children of that >>>>>>>>>>>>>>>>>>>>>>>>>>> folder/library/site/subsite should be appended to the path - ever. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> If you can give a specific example, preferably >>>>>>>>>>>>>>>>>>>>>>>>>>> including the soap back-and-forth, that would be very helpful. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Quick question. Is there an easy way to >>>>>>>>>>>>>>>>>>>>>>>>>>>> configure an SP repo connection for crawling of all content, from the root >>>>>>>>>>>>>>>>>>>>>>>>>>>> site all the way down? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I've tried the set of wildcards as below and I >>>>>>>>>>>>>>>>>>>>>>>>>>>> seem to be running into a lot of cycles, where various subsite folders are >>>>>>>>>>>>>>>>>>>>>>>>>>>> appended to each other and an extraction of data at all of those locations >>>>>>>>>>>>>>>>>>>>>>>>>>>> is attempted. Ideally I'd like to avoid having to construct an exact set of >>>>>>>>>>>>>>>>>>>>>>>>>>>> paths because the set may change, especially with new content being added. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Path rules: >>>>>>>>>>>>>>>>>>>>>>>>>>>> /* file include >>>>>>>>>>>>>>>>>>>>>>>>>>>> /* library include >>>>>>>>>>>>>>>>>>>>>>>>>>>> /* list include >>>>>>>>>>>>>>>>>>>>>>>>>>>> /* site include >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata: >>>>>>>>>>>>>>>>>>>>>>>>>>>> /* include true >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd also like to pull down any files attached >>>>>>>>>>>>>>>>>>>>>>>>>>>> to list items. I'm hoping that some type of "/* file include" should do it, >>>>>>>>>>>>>>>>>>>>>>>>>>>> once I figure out how to safely include all content. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --001a11c2781415d7da04e6a8683f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Yes that's right, only really interested in the site t= hat you are trying to crawl


On Wed, Sep 18, 2013 at 11:25 PM, Dmitry Goldenberg= <dgoldenberg@kmwllc.com> wrote:
Will,
=A0
For SharePoint - 80, the output is
=A0
NTAuthen= ticationProviders=A0=A0=A0=A0=A0=A0 : (STRING) "NTLM"
=A0
I assume we're not interested in the Default Web Sit= e; for that, the output is simply "The parameter NTAuthenticationProvi= ders is not set at this node."
=A0
- Dmitry
=


On Wed, Sep 18, 2013 at 9:16 AM, Will Parkinson <= parkinson.wil= l@gmail.com> wrote:
If you open IIS manager and= click on sites, it is displayed in the ID column (see screenshot attached)=


On Wed,= Sep 18, 2013 at 10:55 PM, Dmitry Goldenberg <dgoldenberg@kmwllc.com= > wrote:
Hi Will,
= Sorry, what is the "sharepoint website number" in= that invokation?=A0
- Dmitry


On Wed, Sep 18, 2013 at 8:= 53 AM, Will Parkinson <parkinson.will@gmail.com> wrot= e:
Hi Dmitry

Jus= t out of interest, what does the following command output on your system
cd to C:\inetpub\adminscripts

cscript adsutil.vbs get w= 3svc/<put your sharepoint website number here>/root/NTAuthenticationP= roviders

Cheers,

Will


<= div class=3D"gmail_quote">On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright <= daddywri@gmail.com> wrote:
"This is the second time I= 'm encountering the issue which leads me to believe it's a quirk of= IIS and/or SharePoint."

It cannot be just a quirk of SharePoint because SharePoint's = UI etc could not create or work with subsites if that was true.=A0 It may w= ell be a configuration issue with IIS, which is indeed what I suspect.=A0 I= have pinged all the resources I know of to try and get some insight as to = why this is happening.


"Perhaps this is something that can be worked into the 'fabric= ' of ManifoldCF as a workaround for a known issue."

L= ike I said before, this is a huge amount of work, tantamount to rewriting m= ost of the connector.=A0 If this is what you want to request, that is your = option, but there is no way we'd complete any of this work before Decem= ber/January at the earliest.


"Just to understand this a bit better, the main breakage here is t= hat the wildcards don't work properly, right? "

No, it means that ManifoldCF cannot get at any data of any kind associated= with a SharePoint subsite.=A0 Accessing root data works fine.=A0 If you tr= y to crawl as things are now, you must disable all subsites and just crawl = the root site, or you will crawl the same things with longer and longer pat= hs indefinitely.

Karl





On Wed, Sep 18, 2013 at 8:38 AM, D= mitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
This i= s the second time I'm encountering the issue which leads me to believe = it's a quirk of IIS and/or SharePoint. Perhaps this is something that c= an be worked into the 'fabric' of ManifoldCF as a workaround for a = known issue. I understand that it may have far reaching tenticles but I won= der if that's really the only option...
=A0
Just to understand this a bit better, the main breakage = here is that the wildcards don't work properly, right?=A0 In theory if = I have a repo connector config which lists specific library and list paths,= things should work?=A0 It's only when the /* types of wildcards are in= cluded, we're in trouble?
=A0
- Dmitry


On Wed, Sep 18, 2013 at= 8:07 AM, Karl Wright <daddywri@gmail.com> wrote:
Apparently it does depend on how you get to the web service, whic= h does argue that it is an IIS issue.

Karl



On Tue, Sep 17, 201= 3 at 5:44 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Dmitry,

As= discussed privately I had a look at your system.=A0 What is happening is that the C# static=20 SPContext.Current.Web is not reflecting the subsite in any url that=20 contains a subsite.=A0 In other words, the URL coming in might be "http://servername/subsite1/_vti_bin/MCPermissions.asmx", bu= t the MCPermissions.asmx plugin will think it is being executed in the root= context ("http://serv= ername").=A0 That's pretty broken behavior, so I'm guessin= g that the problem is that either IIS or SharePoint is somehow misconfigured to d= o this, and the=20 web services would then begin to work right again.=A0 But I have no idea ho= w this should actually be fixed.

Will Parkinson, one of the su= bscribers of this list, may find the symptoms meaningful, since he set up a= n AWS SharePoint instance before.=A0 I hope he will respond in a helpful wa= y.=A0 Until then, I think we are stuck.

Thanks,
Karl



On Tue, Sep 17, 2013 at 9:49 AM, Dmitry= Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
It = looks like I'll be able to get access for you to the test system we'= ;re using. Would you be interested in working with the system directly? I c= ertainly don't mind doing some testing but I thought we'd speed thi= ngs up this way. If so, could you email me from a more private account so w= e can set this up?
=A0
Thanks,
- Dmitry
=A0


On Tue,= Sep 17, 2013 at 7:38 AM, Karl Wright <daddywri@gmail.com> = wrote:
Hi Dmitry,

Another inter= esting bit from the log:

>>>>>>
DEBUG 2013-09-16 11:43:56,799 (Worker threa= d '7') - SharePoint: Library list: '/_catalogs/lt/Forms/AllItem= s.aspx', 'List Template Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master= Page Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7&#= 39;) - SharePoint: Library list: '/Shared Documents/Forms/AllItems.aspx= ', 'Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'=
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:= Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages= 9;
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solutio= n Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7')= - SharePoint: Library list: '/Style Library/Forms/AllItems.aspx', = 'Style Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library= 1'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - Shar= ePoint: Library list: '/_catalogs/theme/Forms/AllItems.aspx', '= Theme Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Galle= ry'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - Shar= ePoint: Checking whether to include library '/Abcd/Klmnopqr/Klmnopqr/De= fghij/Defghij/Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exa= ctly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worke= r thread '7') - SharePoint: Including library '/Abcd/Klmnopqr/K= lmnopqr/Defghij/Defghij/Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Che= cking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defgh= ij/SiteAssets'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7&#= 39;) - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Sit= eAssets' exactly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Inc= luding library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'=
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:= Checking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/D= efghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly ma= tched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker threa= d '7') - SharePoint: Including library '/Abcd/Klmnopqr/Klmnopqr= /Defghij/Defghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Che= cking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defgh= ij/Style Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '= 7') - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/= Style Library' exactly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Inc= luding library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library&#= 39;
<<<<<<

This time it appears that it is the = Lists service that is broken and does not recognize the parent site.

I haven't corrected this problem yet since now I am beginning to wo= nder if *any* of the web services under Amazon work at all for subsites.=A0= We may be better off implementing everything we need in the MCPermissions = service.=A0 I will ponder this as I continue to research the logs.

It's still valuable to check my getSites() implementation.=A0= I'll be doing another round of work tonight on the plugin.


=
Karl


On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com>= wrote:
The augmented plugin can be downloaded from http://people.apache.org/~kwright/MetaCarta.SharePoint= .MCPermissionsService.wsp .=A0 The revised connector code is also ready= , and should be checked out and built from https://s= vn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772 .

Once you set it all up, you can see if it is doing the right thing by j= ust trying to drill down through subsites in the UI.=A0 You should always s= ee a list of subsites that is appropriate for the context you are in; if th= is does not happen it is not working.

Thanks,
Karl


<= br>
On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Golde= nberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
I can = see how preloading the list of subsites may be less optimal.. The advantage= of doing it this way is one call and you've got the structure in memor= y, which may be OK unless there are sites with a ton of subsites which may = stress out memory. The disadvantage is having to throw this structure aroun= d..
=A0
Yes, I'll certainly help test out your changes, just= let me know when they're available.
=A0
Thanks,
- Dmitry


On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

Thanks for the code snip= pet.=A0 I'd prefer, though, to not preload the entire site structure in= memory.=A0 Probably it would be better to just add another method to the M= anifoldCF SharePoint 2010 plugin.=A0 More methods are going to be added any= way to support Claim Space Authentication, so I guess this would be just on= e more.

We honestly have never seen this problem before - so it's not= just flakiness, it has something to do with the installation, I'm cert= ain.=A0 At any rate, I'll get going right away on a workaround - if you= are willing to test what I produce.=A0 I'm also certain there is at le= ast one other issue, but hopefully that will become clearer once this one i= s resolved.

Thanks,
Karl




On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenb= erg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
&= gt;> subsite discovery is effectively disabled except directly under the= root site
=A0
Yes. Come to think of it, I once came across this = problem while implementing a SharePoint connector.=A0 I'm not sure whet= her it's exactly what's happening with the issue we're discussi= ng but looks like it.
=A0
I started off by using multiple getWebCollection calls t= o get child subsites of sites and trying=A0to navigate down that way. The p= roblem was that getWebCollection was always returning the immediate subsite= s of the root site no matter whether you're at the root or below, so I = ended up generating infinite loops.
=A0
I switched over to using=A0a single getAllSubWebCollecti= on=A0call and caching its results. That call returns the full list of all s= ubsites as pairs of Title and Url.=A0 I had a POJO similar to the one below= which held the list of sites and contained logic for enumerating the child= sites, given the URL of a (parent) site.=A0 From what I recall,=A0getWebCo= llection works inconsistently, either across SP versions or across installa= tions, but the logic below should work in any case.
=A0
*** public class SubSiteCollection -- holds a list of Cr= awledSite pojo's each of which is a { title, url }.
=A0
=
*** SubSiteCollection has the following:
=A0
=A0pu= blic List<CrawledSite> getImmediateSubSites(String siteUrl) {
=A0=A0List<CrawledSite> subSites =3D new ArrayList<CrawledSite>= ();
=A0=A0for (CrawledSite site : sites) {
=A0=A0=A0if (isChil= dOf(siteUrl, site.getUrl().toString())) {
=A0=A0=A0=A0subSites.add(site)= ;
=A0=A0=A0}
=A0=A0}
=A0=A0return subSites;
=A0}
=A0
=A0private stat= ic boolean isChildOf(String parentUrl, String urlToCheck) {
=A0=A0final = String parent =3D normalizeUrl(parentUrl);
=A0=A0final String child =3D = normalizeUrl(urlToCheck);
=A0=A0boolean ret =3D false;
=A0=A0if (child.startsWith(parent)) = {
=A0=A0=A0String remainder =3D child.substring(parent.length());
=A0= =A0=A0ret =3D StringUtils.countOccurrencesOf(remainder, SLASH) =3D=3D 1;=A0=A0}
=A0=A0return ret;
=A0}
=A0
=A0private static String normalizeUrl(String url) {
= =A0=A0return ((url.endsWith(SLASH)) ? url : url + SLASH).toLowerCase();
= =A0}
=A0
- Dmitry
=A0


On Mon, Sep 16, 2013 at 2:54 PM, Karl Wr= ight <daddywri@gmail.com> wrote:
Hi Dmitry,

Have a look at= this sequence also:

>>>>>>
DEBUG 2013-09-16 11= :43:56,817 (Worker thread '8') - SharePoint: Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', 'Abcd= '
DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Sub= site list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/De= fghij', 'Defghij'
DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Sub= site list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/K= lmnopqr', 'Klmnopqr'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Che= cking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: S= ite '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path &= #39;/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Inc= luding site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
DEBUG 2013-09-16= 11:43:56,818 (Worker thread '8') - SharePoint: Checking whether to= include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Sit= e '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path = '/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -= SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'<= br> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Che= cking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr= 9;
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoin= t: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rul= e path '/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Inc= luding site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'

<<= <<<<

This is using the GetSites(String parent) met= hod with a site name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getti= ng back three sites (!!).=A0 The parent path is not correct, obviously, but= nevertheless this one way in which paths are getting completely messed up.= =A0 It *looks* like the Webs web service is broken in such a way as to igno= re the URL coming in, except for the base part, which means that subsite di= scovery is effectively disabled except directly under the root site.

This might still be OK if it is not possible to create subsites o= f subsites in this version of SharePoint.=A0 Can you confirm that this is o= r is not possible?

Karl



On Mon, Sep 16, 2013 at 2:42 PM, Karl Wr= ight <daddywri@gmail.com> wrote:
"This is everything that got generated, fro= m the very beginning"

Well, something isn't right.=A0= What I expect to see that I don't right up front are:

- A webs "getWebCollection" invocation for /_vti_bin/w= ebs.asmx
- Two lists "getListCollection" invocations for /_vti_= bin/lists.asmx

Instead the first transactions = I see are from already busted URLs - which make no sense since there would = be no way they should have been able to get queued yet.

So there are a number of possibilities.=A0 First, maybe the = log isn't getting cleared out, and the session in question therefore st= arts somewhere in the middle of manifoldcf.log.1.=A0 But no:

>>= ;>>>>
C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
grep: i= nput lines truncated - result questionable
<<<<&l= t;<

Nevertheless there are some interesting points her= e.=A0 First, note the following response, which I've been able to deter= mine is against "Test Library 1":

>>>>>>
DEBUG 2013-09-16 13:02:31,590 (Worker threa= d '23') - SharePoint: getListItems xml response: '<GetListIt= ems xmlns=3D"http://schemas.microsoft.com/sharepoint/soap/d= irectory/"><GetListItemsResponse xmlns=3D""><= ;GetListItemsResult FileRef=3D"SitePages/Home.aspx"/></GetL= istItemsResponse></GetListItems>'
DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: Ch= ecking whether to include document '/SitePages/Home.aspx'
DEBUG = 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: File = 9;/SitePages/Home.aspx' exactly matched rule path '/*'
DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: In= cluding file '/SitePages/Home.aspx'
=A0WARN 2013-09-16 13:02:31,= 590 (Worker thread '23') - Sharepoint: Unexpected relPath structure= ; path is '/SitePages/Home.aspx', but expected <list/library>= length of 26
<<<<<<

The FileRef in this case is poin= ting at what, exactly?=A0 Is there a SitePages/Home.aspx in the "Test = Library 1" library?=A0 Or does it mean to refer back to the root site = with this URL construction?=A0 And since this is supposedly at the root lev= el, how come the combined site + library name comes out to 26??=A0 I get 15= , which leaves 11 characters unaccounted for.

I'm still looking at the logs to see if I can glean key = information.=A0 Later, if I could set up a crawl against the sharepoint ins= tance in question, that would certainly help.=A0 I can readily set up an ss= h tunnel if that is what is required.=A0 But I won't be able to do it u= ntil I get home tonight.

Karl



On Mon, Sep 16, 2013 at 1:58 PM= , Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
This i= s everything that got generated, from the very beginning, meaning that I di= d a fresh build, new database, new connection definitions, start. The log m= ust have rolled but the .1 log is included.
=A0
If I were to get you access to the actual test system, w= ould you mind taking a look? It may be more efficient than sending logs..
=A0
- Dmitry


On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <= daddywri@gmail.com> wrote:
These logs are different but have exactly the same pr= oblem; they start in the middle when the crawl is already well underway.=A0= I'm wondering if by chance you have more than one agents process runni= ng or something?=A0 Or maybe the log is rolling and stuff is getting lost?= =A0 What's there is not what I would expect to see, at all.

I *did* manage to find two transactions that look like they = might be helpful, but because the *results* of those transactions are requi= red by transactions that take place minutes *before* in the log, I have no = confidence that I'm looking at anything meaningful.=A0 But I'll get= back to you on what I find nonetheless.

If you decide repeat this exercise, try watching the log wit= h "tail -f" before starting the job.=A0 You should not see any lo= g contents at all until the job is started.

Karl


On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <dgoldenber= g@kmwllc.com> wrote:
Karl,
=A0
Attach= ed please find logs which start at the beginning. I started from a fresh bu= ild (clean db etc.), the logs start at server start, then I create the outp= ut connection and the repo connection, then the job, and then=A0I fire off= =A0the job. I aborted the execution about a minute into it or so.=A0 That&#= 39;s all that's in the logs with:

org.apache.manifoldcf.connectors=3DDEBUG

log4j.logger.httpclient.wire= .header=3DDEBUG
log4j.logger.org.apache.commons.httpclient=3DDEBUG

- Dmitry



On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

Are you sure these are th= e right logs?
- They start right in the middle of a crawl
- They are already in a broken state when they start, e.g. the kinds of th= ings that are being looked up are already nonsense paths

I need to see logs from the BEGINNING of a fresh crawl to see how the n= onsense paths happen.

Thanks,
Karl




On Mon, S= ep 16, 2013 at 11:52 AM, Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
I'= ve generated logs with details as we discussed.
=A0
The job was created afresh, as before:
Path rules:
/= * file include
/* library include
/* list include
/* site include
Metadata:
/* include true=
The logs are attached.
- Dmitry

On Mon, Sep 16= , 2013 at 11:20 AM, Karl Wright <daddywri@gmail.com> wrote:=
"Do you think th= at this issue is generic with regard to any Amz instance?"

I presume so, since you didn't apparently do anything special= to set one of these up.=A0 Unfortunately, such instances are not part of t= he free tier, so I am still constrained from setting one up for myself beca= use of household rules here.

"For now, I assume our only workaround is to list the paths of int= erest manually"

Depending on what is going wrong, that ma= y not even work.=A0 It looks like several SharePoint web service calls may = be affected, and not in a cleanly predictable way, for this to happen.

"is identification and extraction of attachments supported in the = SP connector?"

ManifoldCF in general leaves identificatio= n and extraction to the search engine.=A0 Solr, for instance uses Tika for = this, if so configured.=A0 You can configure your Solr output connection to= include or exclude specific mime types or extensions if you want to limit = what is attempted.

Karl



<= br>
On Mon, Sep 16, 2013 at 11:09 AM, Dmitry = Goldenberg <dgoldenberg@kmwllc.com> wrote:
Thanks, Karl. Do you think that this= issue is generic with regard to any Amz instance? I'm just wondering h= ow easily reproducible this may be..
=A0
For now, I assume our only workaround is to list the pat= hs of interest manually, i.e. add explicit rules for each library and list.=
=A0
A related subject - is identification and extraction of = attachments supported in the SP connector?=A0 E.g. if I have a Word doc att= ached to a Task list item, would that be extracted?=A0 So far, I see that l= ibrary content gets crawled and I'm getting the list item data but am n= ot sure what happens to the attachments.


On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <daddywri@gmail.com&= gt; wrote:
Hi Dmitry,

Thanks for the add= itional information.=A0 It does appear like the method that lists subsites = is not working as expected under AWS.=A0 Nor are some number of other metho= ds which supposedly just list the children of a subsite.

I've reopened CONNECTORS-772 to work on addressing this issue= .=A0 Please stay tuned.

Karl



On Mon, Sep 16, 2013 at 10:08 AM, Dmitr= y Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
Mos= t of the paths that get generated are listed in the attached log, they matc= h what shows up in the diag report. So I'm not sure where they diverge,= most of them just don't seem right.=A0 There are 3 subsites rooted in = the main site: Abcd, Defghij, Klmnopqr.=A0 It's strange that the connec= tor would try such paths as:

/Klmnopqr/Defghij/Defghij/Announcements/// -- there ar= e multiple repetitions of the same subsite on the path and to begin with, D= efghij is not a subsite of Klmnopqr, so why would it try this?=A0the /// at= the end doesn't seem correct either, unless I'm missing something = in how this pathing works.

/Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcement= s -- looks wrong. A docname is mixed into the path, a subsite ends up after= a docname?...

/Shared D= ocuments/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of iss= ues plus now somehow the docname got split with a forward slash?..

There are also a bunch of StringIndexOutOfBoundsException's.=A0 Perh= aps this logic doesn't fit with the pathing we're seeing on this am= z-based installation?

I'd expect the logic to just know that root= contains 3 subsites, and work off that. Each subsite has a specific list o= f libraries and lists, etc. It seems odd that the connector gets into this = matching pattern, and tries what looks like thousands of variations (I abor= ted the execution).

- Dmitry



On Mon, Sep 16, = 2013 at 7:56 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Dmitry,

To clar= ify, the way you would need to analyze this is to run a crawl with the wild= cards as you have selected, abort if necessary after a while, and then use = the Document Status report to list the document identifiers that had been g= enerated.=A0 Find a document identifier that you believe represents a path = that is illegal, and figure out what SOAP getChild call caused the problem = by returning incorrect data.=A0 In other words, find the point in the path = where the path diverges from what exists into what doesn't exist, and g= o back in the ManifoldCF logs to find the particular SOAP request that led = to the issue.

I'd expect from your description that the problem lies with getting= child sites given a site path, but that's just a guess at this point.<= br>
Karl



On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

I don't understand w= hat you mean by "I've tried the set of wildcards as below and I se= em to be running into a lot of cycles, where various subsite folders are appended to each other and an extraction of data at all of those locations is attempted".=A0= =A0 If you are seeing cycles it means that document discovery is still fail= ing in some way.=A0 For each folder/library/site/subsite, only the children= of that folder/library/site/subsite should be appended to the path - ever.=

If you can give a specific example, preferably including the soap back-= and-forth, that would be very helpful.
Karl



On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
Qui= ck question. Is there an easy way to configure an SP repo connection for cr= awling of all content, from the root site all the way down?
=A0
I've tried the set of wildcards as below and I seem = to be running into a lot of cycles, where various subsite folders are appen= ded to each other and an extraction of data at all of those locations is at= tempted. Ideally I'd like to avoid having to construct an exact set of = paths because the set may change, especially with new content being added.<= /div>
=A0
Path rules:
/* file include
/* library include /* list include
/* site include
=A0
Metadata:/* include true
=A0
I'd also like to pull down any= files attached to list items. I'm hoping that some type of "/* fi= le include" should do it, once I figure out how to safely include all = content.
=A0
Thanks,
- Dmitry




























--001a11c2781415d7da04e6a8683f--