Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 899D21045D for ; Wed, 18 Sep 2013 20:32:03 +0000 (UTC) Received: (qmail 20324 invoked by uid 500); 18 Sep 2013 20:32:03 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 20126 invoked by uid 500); 18 Sep 2013 20:32:00 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 19974 invoked by uid 99); 18 Sep 2013 20:31:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 20:31:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-ob0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 20:31:44 +0000 Received: by mail-ob0-f172.google.com with SMTP id gq1so8503528obb.31 for ; Wed, 18 Sep 2013 13:31:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=seMmRObFLIXQuUmvzyBxOpVzNEIgXXG9V2murs/YgQo=; b=QKSbL5h4uxK0OwztOKvrQm9en+HKOktJf6HoXeJnGwU4zfo41cK9xdZoRWMxCLab31 mfQXkm1aJ9Ki/U8CpaLByZRa5fhrciZa1eKfL5otwdKsifXLNAhfyy+LeBNoF26WR+HK sbEVdLf1V76ERGPbrlhxD3+5dtGk/3wpbufSKaVwjz5qlv5uRySTceuW76GP2fLs4Bjv sljBWdqFQof2sB+GVhVBDyAtH6/pWplF9FKoVTzq3px78FtazrH4qYB2P0Vnsk4lNbBf TedSjLMhlO2+l+3QijuaIUq6SjUvQtMJUfxpv/hzOI1mCZoaHwO1Kccbw6lK95JEbFcT sT8Q== MIME-Version: 1.0 X-Received: by 10.60.52.81 with SMTP id r17mr36143898oeo.3.1379536282729; Wed, 18 Sep 2013 13:31:22 -0700 (PDT) Received: by 10.182.213.201 with HTTP; Wed, 18 Sep 2013 13:31:22 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Sep 2013 16:31:22 -0400 Message-ID: Subject: Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a1133073a16941c04e6ae52ec X-Virus-Checked: Checked by ClamAV on apache.org --001a1133073a16941c04e6ae52ec Content-Type: text/plain; charset=ISO-8859-1 Tried a crawl here, with the following rules: site: "/" library: "/*" file: "/*" Crawled 10 documents properly and completed, indexing 4 actual files. I'm going to try lists, and if that works, merge the contents of CONNECTORS-772 branch into trunk. Karl On Wed, Sep 18, 2013 at 2:56 PM, Karl Wright wrote: > I forgot to mention: I removed the "4.0 AWS" selection. Select just plain > 4.0 instead. > > Karl > > > > On Wed, Sep 18, 2013 at 2:06 PM, Karl Wright wrote: > >> Thanks. >> >> I committed a better fix. You will need a clean job again though if you >> want to try it. >> >> Karl >> >> >> >> On Wed, Sep 18, 2013 at 1:30 PM, Dmitry Goldenberg < >> dgoldenberg@kmwllc.com> wrote: >> >>> Karl, >>> >>> Attaching the full log. >>> >>> - Dmitry >>> >>> >>> On Wed, Sep 18, 2013 at 1:15 PM, Karl Wright wrote: >>> >>>> Ok - is there a "Checking whether to include library" message in the >>>> log? If so, can you send that to me? >>>> >>>> Karl >>>> >>>> >>>> On Wed, Sep 18, 2013 at 1:02 PM, Dmitry Goldenberg < >>>> dgoldenberg@kmwllc.com> wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> I'm definitely seeing this issue, after a full 'rejig' of the system: >>>>> svn up, ant clean (actually blew away dist/example), ant build, re-created >>>>> the connectors and and job. Still seeing those string index out of bounds >>>>> exceptions. >>>>> >>>>> - Dmitry >>>>> >>>>> >>>>> On Wed, Sep 18, 2013 at 12:15 PM, Karl Wright wrote: >>>>> >>>>>> Hi Dmitry, >>>>>> >>>>>> I think this is the same bug I fixed earlier today. I think you just >>>>>> have a job around from before the code change that fixed it. If you can >>>>>> create a new job and run that, see if you get the same issue. >>>>>> >>>>>> I'll be able to explore this more thoroughly when I get home tonight; >>>>>> from here I cannot see your instance due to firewall. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 18, 2013 at 12:01 PM, Karl Wright wrote: >>>>>> >>>>>>> Not a regression; a bug I introduced. Let me look at it - should be >>>>>>> fixable shortly. >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 18, 2013 at 11:48 AM, Dmitry Goldenberg < >>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> I've just re-tested using the latest. I wonder if there's a >>>>>>>> regression issue. Just crawling /Shared Documents of the root site, I'm >>>>>>>> running into what seems like an indefinite loop of retrying to crawl that >>>>>>>> directory, with the following error showing up time after time: >>>>>>>> >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: >>>>>>>> Getting version of '//Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: >>>>>>>> Checking whether to include document '/Shared >>>>>>>> Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: >>>>>>>> File '/Shared Documents/test-word-doc-1.docx' exactly matched rule path >>>>>>>> '/Shared Documents/*' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: >>>>>>>> Including file '/Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: >>>>>>>> Finding metadata to include for document/item '/Shared >>>>>>>> Documents/test-word-doc-1.docx'. >>>>>>>> >>>>>>>> FATAL 2013-09-18 11:42:25,004 (Worker thread '0') - Error tossed: >>>>>>>> String index out of range: -1 >>>>>>>> >>>>>>>> java.lang.StringIndexOutOfBoundsException: String index out of >>>>>>>> range: -1 >>>>>>>> >>>>>>>> at java.lang.String.substring(String.java:1911) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.getDocumentVersions(SharePointRepository.java:926) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322) >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: >>>>>>>> Getting version of '//Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: >>>>>>>> Checking whether to include document '/Shared >>>>>>>> Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: >>>>>>>> File '/Shared Documents/test-word-doc-1.docx' exactly matched rule path >>>>>>>> '/Shared Documents/*' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: >>>>>>>> Including file '/Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: >>>>>>>> Finding metadata to include for document/item '/Shared >>>>>>>> Documents/test-word-doc-1.docx'. >>>>>>>> >>>>>>>> FATAL 2013-09-18 11:42:26,840 (Worker thread '2') - Error tossed: >>>>>>>> String index out of range: -1 >>>>>>>> >>>>>>>> java.lang.StringIndexOutOfBoundsException: String index out of >>>>>>>> range: -1 >>>>>>>> >>>>>>>> at java.lang.String.substring(String.java:1911) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.getDocumentVersions(SharePointRepository.java:926) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322) >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: >>>>>>>> Getting version of '//Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: >>>>>>>> Checking whether to include document '/Shared >>>>>>>> Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: >>>>>>>> File '/Shared Documents/test-word-doc-1.docx' exactly matched rule path >>>>>>>> '/Shared Documents/*' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: >>>>>>>> Including file '/Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: >>>>>>>> Finding metadata to include for document/item '/Shared >>>>>>>> Documents/test-word-doc-1.docx'. >>>>>>>> >>>>>>>> FATAL 2013-09-18 11:42:26,865 (Worker thread '1') - Error tossed: >>>>>>>> String index out of range: -1 >>>>>>>> >>>>>>>> java.lang.StringIndexOutOfBoundsException: String index out of >>>>>>>> range: -1 >>>>>>>> >>>>>>>> at java.lang.String.substring(String.java:1911) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.getDocumentVersions(SharePointRepository.java:926) >>>>>>>> >>>>>>>> at >>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322) >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: >>>>>>>> Getting version of '//Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: >>>>>>>> Checking whether to include document '/Shared >>>>>>>> Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: >>>>>>>> File '/Shared Documents/test-word-doc-1.docx' exactly matched rule path >>>>>>>> '/Shared Documents/*' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: >>>>>>>> Including file '/Shared Documents/test-word-doc-1.docx' >>>>>>>> >>>>>>>> DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: >>>>>>>> Finding metadata to include for document/item '/Shared >>>>>>>> Documents/test-word-doc-1.docx'. >>>>>>>> >>>>>>>> FATAL 2013-09-18 11:42:26,895 (Worker thread '3') - Error tossed: >>>>>>>> String index out of range: -1 >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 18, 2013 at 11:27 AM, Karl Wright wrote: >>>>>>>> >>>>>>>>> Hi Dmitry, >>>>>>>>> >>>>>>>>> It may be worth reviewing with that engineer what steps he took >>>>>>>>> when he installed the instance. If he used the standard installer, IIRC >>>>>>>>> there are a number of ways you can mess this up - the primary way being if >>>>>>>>> you try to install IIS afterwards and then just try to patch things up. >>>>>>>>> The canned install usually does best if IIS is installed first. >>>>>>>>> >>>>>>>>> At any rate, I think that you have a probable case of "operator >>>>>>>>> error" here... >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I can think of a few possibilities. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 18, 2013 at 11:16 AM, Dmitry Goldenberg < >>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>> >>>>>>>>>> SharePoint was not installed by a domain user (the Windows >>>>>>>>>> instance is not on a domain). >>>>>>>>>> >>>>>>>>>> This is not a canned AWS SharePoint installation; an engineer on >>>>>>>>>> the team installed it, using the standard installer program, I believe. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 18, 2013 at 10:34 AM, Will Parkinson < >>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Dmitry, do you know if Sharepoint was installed by a domain >>>>>>>>>>> user? I have heard of issues with Sharepoint if not installed using a >>>>>>>>>>> domain user (e.g. DOMAIN\someuser) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 19, 2013 at 12:31 AM, Will Parkinson < >>>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> No, i didnt have that issue. The issue i had was the // and >>>>>>>>>>>> /// references being added in the wrong places in the page URL's >>>>>>>>>>>> >>>>>>>>>>>> I was getting things like >>>>>>>>>>>> >>>>>>>>>>>> /Site Name/Lib///rary/test.aspx >>>>>>>>>>>> >>>>>>>>>>>> My first set up was an out of the box set up, the main site was >>>>>>>>>>>> on port 80, using classic authentication. With the path modification in >>>>>>>>>>>> the mcf-sharepoint-connector.jar, it worked very well. >>>>>>>>>>>> >>>>>>>>>>>> I set up active directory on that same server to authenticate >>>>>>>>>>>> via NTLM >>>>>>>>>>>> >>>>>>>>>>>> The second server had the site on https on port 443, had claims >>>>>>>>>>>> based authentication using ADFS and kerberos. I had to modify the >>>>>>>>>>>> mcf-sharepoint-connector.jar and MCPermissions.wsp to get this to work >>>>>>>>>>>> around the lack of SID's returned from the permissions webservice. >>>>>>>>>>>> >>>>>>>>>>>> In this case, Active Directory and ADFS were set up on separate >>>>>>>>>>>> AWS servers >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 19, 2013 at 12:23 AM, Karl Wright < >>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Will, >>>>>>>>>>>>> >>>>>>>>>>>>> The path stuff we're already dealing with - see the >>>>>>>>>>>>> CONNECTORS-772 branch. But what we are having trouble with is something >>>>>>>>>>>>> much more fundamental. On Dmitry's AWS instance, when you talk to the web >>>>>>>>>>>>> services for a root site, it works fine. But as soon as you add a subsite >>>>>>>>>>>>> path into the URL, it *seems* to work fine, but actually behaves as though >>>>>>>>>>>>> you never specified any subsite at all - it returns root site information >>>>>>>>>>>>> only. On this system, this occurs for ALL web services, even Microsoft's. >>>>>>>>>>>>> The reason is that the value of SPContext.Current.Web never points to the >>>>>>>>>>>>> subsite you specified. The result is that you cannot use SharePoint >>>>>>>>>>>>> subsites with ManifoldCF without causing havoc. >>>>>>>>>>>>> >>>>>>>>>>>>> Does this sound completely unfamiliar to you? If you never >>>>>>>>>>>>> encountered it, then we should compare how these instances were set up, >>>>>>>>>>>>> unless you have any further ideas. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 18, 2013 at 10:12 AM, Will Parkinson < >>>>>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey Karl (and Dmitry) >>>>>>>>>>>>>> >>>>>>>>>>>>>> For AWS, i had to modify the way the the relPath in the in >>>>>>>>>>>>>> the addFile function in the FileStream class (in SharepointRepository.java) >>>>>>>>>>>>>> calculated the modifiedPath >>>>>>>>>>>>>> >>>>>>>>>>>>>> Essentially, i ensured that the relPath always contains the >>>>>>>>>>>>>> site as part of the path >>>>>>>>>>>>>> >>>>>>>>>>>>>> if (siteName != "") { >>>>>>>>>>>>>> int siteInd = relPath.indexOf(siteName); >>>>>>>>>>>>>> if (siteInd == -1 || siteInd > 3) { >>>>>>>>>>>>>> relPath = siteName + relPath; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which fixed my pathing issue and the index out of bounds >>>>>>>>>>>>>> errors. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have also made many other modification to cope with AD and >>>>>>>>>>>>>> claims based auth and compatibility with Sharepoint 2013 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dmitry, i have uploaded my modified >>>>>>>>>>>>>> mcf-sharepoint-connector.jar and MCPermissions WSP if you would like to try >>>>>>>>>>>>>> them out >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://pngnetworks.com/sharepoint-2010-claims.zip >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just make sure you back up your current ones as this is still >>>>>>>>>>>>>> very much in development :) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, the logging is very verbose. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Will >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 11:41 PM, Karl Wright < >>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Will, >>>>>>>>>>>>>>> When you folks set up YOUR AWS instance, did it work with >>>>>>>>>>>>>>> MCF out of the box? Or did you need to do something? And, if so, what did >>>>>>>>>>>>>>> you do? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 9:28 AM, Will Parkinson < >>>>>>>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes that's right, only really interested in the site that >>>>>>>>>>>>>>>> you are trying to crawl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 11:25 PM, Dmitry Goldenberg < >>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Will, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For SharePoint - 80, the output is >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> NTAuthenticationProviders : (STRING) "NTLM" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I assume we're not interested in the Default Web Site; for >>>>>>>>>>>>>>>>> that, the output is simply "The parameter NTAuthenticationProviders is not >>>>>>>>>>>>>>>>> set at this node." >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 9:16 AM, Will Parkinson < >>>>>>>>>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you open IIS manager and click on sites, it is >>>>>>>>>>>>>>>>>> displayed in the ID column (see screenshot attached) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 10:55 PM, Dmitry Goldenberg < >>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> **Hi Will, >>>>>>>>>>>>>>>>>>> Sorry, what is the "sharepoint website *number*" in >>>>>>>>>>>>>>>>>>> that invokation? >>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 8:53 AM, Will Parkinson < >>>>>>>>>>>>>>>>>>> parkinson.will@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Dmitry >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Just out of interest, what does the following command >>>>>>>>>>>>>>>>>>>> output on your system >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> cd to C:\inetpub\adminscripts >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *cscript adsutil.vbs get w3svc/>>>>>>>>>>>>>>>>>>> website number here>/root/NTAuthenticationProviders* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Will >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> "This is the second time I'm encountering the issue >>>>>>>>>>>>>>>>>>>>> which leads me to believe it's a quirk of IIS and/or SharePoint." >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It cannot be just a quirk of SharePoint because >>>>>>>>>>>>>>>>>>>>> SharePoint's UI etc could not create or work with subsites if that was >>>>>>>>>>>>>>>>>>>>> true. It may well be a configuration issue with IIS, which is indeed what >>>>>>>>>>>>>>>>>>>>> I suspect. I have pinged all the resources I know of to try and get some >>>>>>>>>>>>>>>>>>>>> insight as to why this is happening. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> "Perhaps this is something that can be worked into the >>>>>>>>>>>>>>>>>>>>> 'fabric' of ManifoldCF as a workaround for a known issue." >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Like I said before, this is a huge amount of work, >>>>>>>>>>>>>>>>>>>>> tantamount to rewriting most of the connector. If this is what you want to >>>>>>>>>>>>>>>>>>>>> request, that is your option, but there is no way we'd complete any of this >>>>>>>>>>>>>>>>>>>>> work before December/January at the earliest. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> "Just to understand this a bit better, the main >>>>>>>>>>>>>>>>>>>>> breakage here is that the wildcards don't work properly, right? " >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> No, it means that ManifoldCF cannot get at any data of >>>>>>>>>>>>>>>>>>>>> any kind associated with a SharePoint subsite. Accessing root data works >>>>>>>>>>>>>>>>>>>>> fine. If you try to crawl as things are now, you must disable all subsites >>>>>>>>>>>>>>>>>>>>> and just crawl the root site, or you will crawl the same things with longer >>>>>>>>>>>>>>>>>>>>> and longer paths indefinitely. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 8:38 AM, Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This is the second time I'm encountering the issue >>>>>>>>>>>>>>>>>>>>>> which leads me to believe it's a quirk of IIS and/or SharePoint. Perhaps >>>>>>>>>>>>>>>>>>>>>> this is something that can be worked into the 'fabric' of ManifoldCF as a >>>>>>>>>>>>>>>>>>>>>> workaround for a known issue. I understand that it may have far reaching >>>>>>>>>>>>>>>>>>>>>> tenticles but I wonder if that's really the only option... >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Just to understand this a bit better, the main >>>>>>>>>>>>>>>>>>>>>> breakage here is that the wildcards don't work properly, right? In theory >>>>>>>>>>>>>>>>>>>>>> if I have a repo connector config which lists specific library and list >>>>>>>>>>>>>>>>>>>>>> paths, things should work? It's only when the /* types of wildcards are >>>>>>>>>>>>>>>>>>>>>> included, we're in trouble? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 18, 2013 at 8:07 AM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Someone else was having a similar problem. See >>>>>>>>>>>>>>>>>>>>>>> http://social.technet.microsoft.com/Forums/sharepoint/en-US/e4b53c63-b89a-4356-a7b0-6ca7bfd22826/getting-sharepoint-subsite-from-custom-webservice. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Apparently it does depend on how you get to the web >>>>>>>>>>>>>>>>>>>>>>> service, which does argue that it is an IIS issue. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 5:44 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> As discussed privately I had a look at your >>>>>>>>>>>>>>>>>>>>>>>> system. What is happening is that the C# static SPContext.Current.Web is >>>>>>>>>>>>>>>>>>>>>>>> not reflecting the subsite in any url that contains a subsite. In other >>>>>>>>>>>>>>>>>>>>>>>> words, the URL coming in might be " >>>>>>>>>>>>>>>>>>>>>>>> http://servername/subsite1/_vti_bin/MCPermissions.asmx", >>>>>>>>>>>>>>>>>>>>>>>> but the MCPermissions.asmx plugin will think it is being executed in the >>>>>>>>>>>>>>>>>>>>>>>> root context ("http://servername"). That's pretty >>>>>>>>>>>>>>>>>>>>>>>> broken behavior, so I'm guessing that the problem is that either IIS or >>>>>>>>>>>>>>>>>>>>>>>> SharePoint is somehow misconfigured to do this, and the web services would >>>>>>>>>>>>>>>>>>>>>>>> then begin to work right again. But I have no idea how this should >>>>>>>>>>>>>>>>>>>>>>>> actually be fixed. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Will Parkinson, one of the subscribers of this >>>>>>>>>>>>>>>>>>>>>>>> list, may find the symptoms meaningful, since he set up an AWS SharePoint >>>>>>>>>>>>>>>>>>>>>>>> instance before. I hope he will respond in a helpful way. Until then, I >>>>>>>>>>>>>>>>>>>>>>>> think we are stuck. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 9:49 AM, Dmitry Goldenberg >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> It looks like I'll be able to get access for you >>>>>>>>>>>>>>>>>>>>>>>>> to the test system we're using. Would you be interested in working with the >>>>>>>>>>>>>>>>>>>>>>>>> system directly? I certainly don't mind doing some testing but I thought >>>>>>>>>>>>>>>>>>>>>>>>> we'd speed things up this way. If so, could you email me from a more >>>>>>>>>>>>>>>>>>>>>>>>> private account so we can set this up? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 7:38 AM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Another interesting bit from the log: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/_catalogs/lt/Forms/AllItems.aspx', 'List >>>>>>>>>>>>>>>>>>>>>>>>>> Template Gallery' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/_catalogs/masterpage/Forms/AllItems.aspx', >>>>>>>>>>>>>>>>>>>>>>>>>> 'Master Page Gallery' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/Shared Documents/Forms/AllItems.aspx', >>>>>>>>>>>>>>>>>>>>>>>>>> 'Shared Documents' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/_catalogs/solutions/Forms/AllItems.aspx', >>>>>>>>>>>>>>>>>>>>>>>>>> 'Solution Gallery' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/Style Library/Forms/AllItems.aspx', 'Style >>>>>>>>>>>>>>>>>>>>>>>>>> Library' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/Test Library 1/Forms/AllItems.aspx', 'Test >>>>>>>>>>>>>>>>>>>>>>>>>> Library 1' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme >>>>>>>>>>>>>>>>>>>>>>>>>> Gallery' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part >>>>>>>>>>>>>>>>>>>>>>>>>> Gallery' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Checking whether to include library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared >>>>>>>>>>>>>>>>>>>>>>>>>> Documents' exactly matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Including library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Checking whether to include library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' >>>>>>>>>>>>>>>>>>>>>>>>>> exactly matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Including library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Checking whether to include library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' >>>>>>>>>>>>>>>>>>>>>>>>>> exactly matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Including library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Checking whether to include library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style >>>>>>>>>>>>>>>>>>>>>>>>>> Library' exactly matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') >>>>>>>>>>>>>>>>>>>>>>>>>> - SharePoint: Including library >>>>>>>>>>>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' >>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> This time it appears that it is the Lists service >>>>>>>>>>>>>>>>>>>>>>>>>> that is broken and does not recognize the parent site. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I haven't corrected this problem yet since now I >>>>>>>>>>>>>>>>>>>>>>>>>> am beginning to wonder if *any* of the web services under Amazon work at >>>>>>>>>>>>>>>>>>>>>>>>>> all for subsites. We may be better off implementing everything we need in >>>>>>>>>>>>>>>>>>>>>>>>>> the MCPermissions service. I will ponder this as I continue to research >>>>>>>>>>>>>>>>>>>>>>>>>> the logs. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> It's still valuable to check my getSites() >>>>>>>>>>>>>>>>>>>>>>>>>> implementation. I'll be doing another round of work tonight on the plugin. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The augmented plugin can be downloaded from >>>>>>>>>>>>>>>>>>>>>>>>>>> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp. The revised connector code is also ready, and should be checked out and >>>>>>>>>>>>>>>>>>>>>>>>>>> built from >>>>>>>>>>>>>>>>>>>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Once you set it all up, you can see if it is >>>>>>>>>>>>>>>>>>>>>>>>>>> doing the right thing by just trying to drill down through subsites in the >>>>>>>>>>>>>>>>>>>>>>>>>>> UI. You should always see a list of subsites that is appropriate for the >>>>>>>>>>>>>>>>>>>>>>>>>>> context you are in; if this does not happen it is not working. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I can see how preloading the list of subsites >>>>>>>>>>>>>>>>>>>>>>>>>>>> may be less optimal.. The advantage of doing it this way is one call and >>>>>>>>>>>>>>>>>>>>>>>>>>>> you've got the structure in memory, which may be OK unless there are sites >>>>>>>>>>>>>>>>>>>>>>>>>>>> with a ton of subsites which may stress out memory. The disadvantage is >>>>>>>>>>>>>>>>>>>>>>>>>>>> having to throw this structure around.. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I'll certainly help test out your changes, >>>>>>>>>>>>>>>>>>>>>>>>>>>> just let me know when they're available. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the code snippet. I'd prefer, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> though, to not preload the entire site structure in memory. Probably it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be better to just add another method to the ManifoldCF SharePoint >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2010 plugin. More methods are going to be added anyway to support Claim >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Space Authentication, so I guess this would be just one more. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> We honestly have never seen this problem >>>>>>>>>>>>>>>>>>>>>>>>>>>>> before - so it's not just flakiness, it has something to do with the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> installation, I'm certain. At any rate, I'll get going right away on a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> workaround - if you are willing to test what I produce. I'm also certain >>>>>>>>>>>>>>>>>>>>>>>>>>>>> there is at least one other issue, but hopefully that will become clearer >>>>>>>>>>>>>>>>>>>>>>>>>>>>> once this one is resolved. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> subsite discovery is effectively disabled >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> except directly under the root site >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes. Come to think of it, I once came across >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this problem while implementing a SharePoint connector. I'm not sure >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether it's exactly what's happening with the issue we're discussing but >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> looks like it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I started off by using multiple >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> getWebCollection calls to get child subsites of sites and trying to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> navigate down that way. The problem was that getWebCollection was always >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> returning the immediate subsites of the root site no matter whether you're >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the root or below, so I ended up generating infinite loops. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I switched over to using a single >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> getAllSubWebCollection call and caching its results. That call returns the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> full list of all subsites as pairs of Title and Url. I had a POJO similar >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the one below which held the list of sites and contained logic for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enumerating the child sites, given the URL of a (parent) site. From what I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> recall, getWebCollection works inconsistently, either across SP versions or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> across installations, but the logic below should work in any case. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *** public class SubSiteCollection -- holds a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list of CrawledSite pojo's each of which is a { title, url }. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *** SubSiteCollection has the following: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> public List >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> getImmediateSubSites(String siteUrl) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> List subSites = new >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ArrayList(); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for (CrawledSite site : sites) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if (isChildOf(siteUrl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> site.getUrl().toString())) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subSites.add(site); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return subSites; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> private static boolean isChildOf(String >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parentUrl, String urlToCheck) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> final String parent = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> normalizeUrl(parentUrl); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> final String child = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> normalizeUrl(urlToCheck); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> boolean ret = false; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if (child.startsWith(parent)) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> String remainder = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> child.substring(parent.length()); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ret = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> StringUtils.countOccurrencesOf(remainder, SLASH) == 1; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return ret; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> private static String normalizeUrl(String >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url) { >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return ((url.endsWith(SLASH)) ? url : url + >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SLASH).toLowerCase(); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Have a look at this sequence also: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Subsite list: ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'Abcd' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Subsite list: ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'Defghij' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Subsite list: ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'Klmnopqr' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Checking whether to include site >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Checking whether to include site >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Checking whether to include site >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matched rule path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '8') - SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is using the GetSites(String parent) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> method with a site name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> three sites (!!). The parent path is not correct, obviously, but >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nevertheless this one way in which paths are getting completely messed up. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It *looks* like the Webs web service is broken in such a way as to ignore >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the URL coming in, except for the base part, which means that subsite >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discovery is effectively disabled except directly under the root site. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This might still be OK if it is not possible >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to create subsites of subsites in this version of SharePoint. Can you >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confirm that this is or is not possible? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "This is everything that got generated, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the very beginning" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Well, something isn't right. What I expect >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to see that I don't right up front are: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - A webs "getWebCollection" invocation for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /_vti_bin/webs.asmx >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Two lists "getListCollection" invocations >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for /_vti_bin/lists.asmx >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Instead the first transactions I see are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from already busted URLs - which make no sense since there would be no way >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> they should have been able to get queued yet. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So there are a number of possibilities. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> First, maybe the log isn't getting cleared out, and the session in question >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> therefore starts somewhere in the middle of manifoldcf.log.1. But no: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> C:\logs>grep "POST /_vti_bin/webs" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifoldcf.log.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> grep: input lines truncated - result >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Nevertheless there are some interesting >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> points here. First, note the following response, which I've been able to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> determine is against "Test Library 1": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread '23') - SharePoint: getListItems xml response: '>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> xmlns="">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FileRef="SitePages/Home.aspx"/>' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread '23') - SharePoint: Checking whether to include document >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '/SitePages/Home.aspx' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread '23') - SharePoint: File '/SitePages/Home.aspx' exactly matched rule >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path '/*' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread '23') - SharePoint: Including file '/SitePages/Home.aspx' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> WARN 2013-09-16 13:02:31,590 (Worker >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread '23') - Sharepoint: Unexpected relPath structure; path is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '/SitePages/Home.aspx', but expected length of 26 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The FileRef in this case is pointing at >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what, exactly? Is there a SitePages/Home.aspx in the "Test Library 1" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> library? Or does it mean to refer back to the root site with this URL >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construction? And since this is supposedly at the root level, how come the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> combined site + library name comes out to 26?? I get 15, which leaves 11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters unaccounted for. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm still looking at the logs to see if I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can glean key information. Later, if I could set up a crawl against the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sharepoint instance in question, that would certainly help. I can readily >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> set up an ssh tunnel if that is what is required. But I won't be able to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> do it until I get home tonight. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is everything that got generated, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the very beginning, meaning that I did a fresh build, new database, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new connection definitions, start. The log must have rolled but the .1 log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is included. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I were to get you access to the actual >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> test system, would you mind taking a look? It may be more efficient than >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sending logs.. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> These logs are different but have exactly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the same problem; they start in the middle when the crawl is already well >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underway. I'm wondering if by chance you have more than one agents process >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running or something? Or maybe the log is rolling and stuff is getting >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lost? What's there is not what I would expect to see, at all. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I *did* manage to find two transactions >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that look like they might be helpful, but because the *results* of those >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transactions are required by transactions that take place minutes *before* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the log, I have no confidence that I'm looking at anything meaningful. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I'll get back to you on what I find nonetheless. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you decide repeat this exercise, try >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> watching the log with "tail -f" before starting the job. You should not >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> see any log contents at all until the job is started. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Attached please find logs which start at >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the beginning. I started from a fresh build (clean db etc.), the logs start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at server start, then I create the output connection and the repo >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connection, then the job, and then I fire off the job. I aborted the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> execution about a minute into it or so. That's all that's in the logs with: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Are you sure these are the right logs? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - They start right in the middle of a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> crawl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - They are already in a broken state >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> when they start, e.g. the kinds of things that are being looked up are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already nonsense paths >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I need to see logs from the BEGINNING >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of a fresh crawl to see how the nonsense paths happen. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've generated logs with details as we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussed. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The job was created afresh, as before: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path rules: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* file include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* library include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* list include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* site include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* include true >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The logs are attached. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Do you think that this issue is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generic with regard to any Amz instance?" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I presume so, since you didn't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> apparently do anything special to set one of these up. Unfortunately, such >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instances are not part of the free tier, so I am still constrained from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> setting one up for myself because of household rules here. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "For now, I assume our only >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> workaround is to list the paths of interest manually" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Depending on what is going wrong, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that may not even work. It looks like several SharePoint web service calls >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> may be affected, and not in a cleanly predictable way, for this to happen. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "is identification and extraction of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> attachments supported in the SP connector?" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF in general leaves >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> identification and extraction to the search engine. Solr, for instance >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> uses Tika for this, if so configured. You can configure your Solr output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connection to include or exclude specific mime types or extensions if you >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> want to limit what is attempted. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Karl. Do you think that this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue is generic with regard to any Amz instance? I'm just wondering how >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easily reproducible this may be.. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For now, I assume our only >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> workaround is to list the paths of interest manually, i.e. add explicit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rules for each library and list. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related subject - is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> identification and extraction of attachments supported in the SP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector? E.g. if I have a Word doc attached to a Task list item, would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that be extracted? So far, I see that library content gets crawled and I'm >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> getting the list item data but am not sure what happens to the attachments. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the additional >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> information. It does appear like the method that lists subsites is not >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> working as expected under AWS. Nor are some number of other methods which >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supposedly just list the children of a subsite. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've reopened CONNECTORS-772 to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work on addressing this issue. Please stay tuned. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Most of the paths that get >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated are listed in the attached log, they match what shows up in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> diag report. So I'm not sure where they diverge, most of them just don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seem right. There are 3 subsites rooted in the main site: Abcd, Defghij, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Klmnopqr. It's strange that the connector would try such paths as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- there are multiple repetitions of the same subsite on the path and to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> begin with, Defghij is not a subsite of Klmnopqr, so why would it try >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this? the /// at the end doesn't seem correct either, unless I'm missing >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> something in how this pathing works. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /Test Library >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- looks wrong. A >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> docname is mixed into the path, a subsite ends up after a docname?... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /Shared >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues plus now somehow the docname got split with a forward slash?.. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There are also a bunch of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's. Perhaps this logic doesn't fit with the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathing we're seeing on this amz-based installation? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd expect the logic to just know >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that root contains 3 subsites, and work off that. Each subsite has a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific list of libraries and lists, etc. It seems odd that the connector >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gets into this matching pattern, and tries what looks like thousands of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> variations (I aborted the execution). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To clarify, the way you would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to analyze this is to run a crawl with the wildcards as you have >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selected, abort if necessary after a while, and then use the Document >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Status report to list the document identifiers that had been generated. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Find a document identifier that you believe represents a path that is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> illegal, and figure out what SOAP getChild call caused the problem by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> returning incorrect data. In other words, find the point in the path where >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the path diverges from what exists into what doesn't exist, and go back in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the ManifoldCF logs to find the particular SOAP request that led to the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd expect from your description >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that the problem lies with getting child sites given a site path, but >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that's just a guess at this point. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl Wright wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't understand what you mean >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by "I've tried the set of wildcards as below and I seem to be running into >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a lot of cycles, where various subsite folders are appended to each other >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and an extraction of data at all of those locations is attempted". If you >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are seeing cycles it means that document discovery is still failing in some >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. For each folder/library/site/subsite, only the children of that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder/library/site/subsite should be appended to the path - ever. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you can give a specific >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example, preferably including the soap back-and-forth, that would be very >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helpful. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dmitry Goldenberg < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Quick question. Is there an >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easy way to configure an SP repo connection for crawling of all content, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from the root site all the way down? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've tried the set of wildcards >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as below and I seem to be running into a lot of cycles, where various >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subsite folders are appended to each other and an extraction of data at all >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of those locations is attempted. Ideally I'd like to avoid having to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> construct an exact set of paths because the set may change, especially with >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new content being added. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path rules: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* file include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* library include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* list include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* site include >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* include true >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd also like to pull down any >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files attached to list items. I'm hoping that some type of "/* file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> include" should do it, once I figure out how to safely include all content. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --001a1133073a16941c04e6ae52ec Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Tried a crawl here, with the foll= owing rules:

site: "/"
library: "/*&qu= ot;
file: "/*"

Crawled 10 documents properl= y and completed, indexing 4 actual files.

I'm going to try lists, and if that works, merge the con= tents of CONNECTORS-772 branch into trunk.

Karl



On Wed, Sep 18, 2013 at 2:56 PM, Karl Wright <daddywri@gmail.com>= wrote:
I forgot to mention: I removed the "4.0 AWS"= ; selection.=A0 Select just plain 4.0 instead.

Karl



On Wed, Sep 18, 2013 at= 2:06 PM, Karl Wright <daddywri@gmail.com> wrote:
Thanks.

I committed = a better fix.=A0 You will need a clean job again though if you want to try = it.

Karl


On Wed, Sep 18, 2013 at 1:30 PM, Dmitry Go= ldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
Attaching the full log.
=A0
- Dmitry


On Wed, Sep 18, 2013 at 1:15 P= M, Karl Wright <daddywri@gmail.com> wrote:
Ok - is there a "= Checking whether to include library" message in the log?=A0 If so, can= you send that to me?

Karl


On Wed, Sep 18, 2013 at 1:02 PM, Dmitry Goldenberg <dgoldenberg@kmwll= c.com> wrote:
Hi Karl,
=A0
I'm definitely s= eeing this issue, after a full 'rejig' of the system: svn up, ant c= lean (actually blew away dist/example), ant build, re-created the connector= s and and job.=A0 Still seeing those string index out of bounds exceptions.=
=A0
- Dmitry


On Wed, Sep 18, 2013 at= 12:15 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Dmitry,

I think= this is the same bug I fixed earlier today.=A0 I think you just have a job= around from before the code change that fixed it.=A0 If you can create a n= ew job and run that, see if you get the same issue.

I'll be able to explore this more thoroughly when I get home tonigh= t; from here I cannot see your instance due to firewall.

Karl<= br>


On Wed, Sep 18, 2013 at 12:01 PM, Karl Wright <daddywri@gmail.com>= wrote:
Not a regression; a bug I introduced.=A0 Let me look at it= - should be fixable shortly.
Karl


On Wed, Sep 18, 2013 at 11:48 AM, Dmitry= Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
I&#= 39;ve just re-tested using the latest. I wonder if there's=A0a regressi= on issue. Just crawling /Shared Documents of the root site, I'm running= into what seems like an indefinite loop of retrying to crawl that director= y, with the following error showing up time after time:
=A0

DEBUG 2013-09-18 11:42:24,959 (Work= er thread '0') - SharePoint: Getting version of '//Shared Docum= ents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: = Checking whether to include document '/Shared Documents/test-word-doc-1= .docx'

DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: = File '/Shared Documents/test-word-doc-1.docx' exactly matched rule = path '/Shared Documents/*'

DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: = Including file '/Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:24,959 (Worker thread '0') - SharePoint: = Finding metadata to include for document/item '/Shared Documents/test-w= ord-doc-1.docx'.

FATAL 2013-09-18 11:42:25,004 (Worker thread '0') - Error tossed= : String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1=

at java.lang.String.substring(String.java:1911)

at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointReposi= tory.getDocumentVersions(SharePointRepository.java:926)

at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.j= ava:322)

DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: = Getting version of '//Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: = Checking whether to include document '/Shared Documents/test-word-doc-1= .docx'

DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: = File '/Shared Documents/test-word-doc-1.docx' exactly matched rule = path '/Shared Documents/*'

DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: = Including file '/Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,835 (Worker thread '2') - SharePoint: = Finding metadata to include for document/item '/Shared Documents/test-w= ord-doc-1.docx'.

FATAL 2013-09-18 11:42:26,840 (Worker thread '2') - Error tossed= : String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1=

at java.lang.String.substring(String.java:1911)

at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointReposi= tory.getDocumentVersions(SharePointRepository.java:926)

at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.j= ava:322)

DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: = Getting version of '//Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: = Checking whether to include document '/Shared Documents/test-word-doc-1= .docx'

DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: = File '/Shared Documents/test-word-doc-1.docx' exactly matched rule = path '/Shared Documents/*'

DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: = Including file '/Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,860 (Worker thread '1') - SharePoint: = Finding metadata to include for document/item '/Shared Documents/test-w= ord-doc-1.docx'.

FATAL 2013-09-18 11:42:26,865 (Worker thread '1') - Error tossed= : String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1=

at java.lang.String.substring(String.java:1911)

at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointReposi= tory.getDocumentVersions(SharePointRepository.java:926)

at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.j= ava:322)

DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: = Getting version of '//Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: = Checking whether to include document '/Shared Documents/test-word-doc-1= .docx'

DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: = File '/Shared Documents/test-word-doc-1.docx' exactly matched rule = path '/Shared Documents/*'

DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: = Including file '/Shared Documents/test-word-doc-1.docx'

DEBUG 2013-09-18 11:42:26,885 (Worker thread '3') - SharePoint: = Finding metadata to include for document/item '/Shared Documents/test-w= ord-doc-1.docx'.

FATAL 2013-09-18 11:42:26,895 (Worker thread '3') - Error tossed= : String index out of range: -1



On Wed, Sep 18, 2013 at 11:27 AM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

It may = be worth reviewing with that engineer what steps he took when he installed = the instance.=A0 If he used the standard installer, IIRC there are a number= of ways you can mess this up - the primary way being if you try to install= IIS afterwards and then just try to patch things up.=A0 The canned install= usually does best if IIS is installed first.

At any rate, I think that you have a probable case of "= operator error" here...

Karl


<= br>
I can think of a few possibilities.=A0


On Wed, Sep 18, 2013 at 11:16 AM, Dmitry= Goldenberg <dgoldenberg@kmwllc.com> wrote:
SharePoint was not installed by a domain user (the Wi= ndows instance is not on a domain).
=A0
This is not a c= anned AWS SharePoint installation; an engineer on the team installed it, us= ing the standard installer program, I believe.


On Wed, Sep 18, 2013 at 10:34 AM, Will Parkinson <<= a href=3D"mailto:parkinson.will@gmail.com" target=3D"_blank">parkinson.will= @gmail.com> wrote:
Dmitry, do you know if Sharepoint was ins= talled by a domain user?=A0 I have heard of issues with Sharepoint if not i= nstalled using a domain user (e.g. DOMAIN\someuser)


On Thu, Sep 19, 2013 at 12:31 AM, Will Parkinson= <parkinson.will@gmail.com> wrote:
No, i didnt have that issue.= =A0 The issue i had was the // and /// references being added in the wrong = places in the page URL's

I was getting things like

/Site Name/Lib///rary/test.aspx

My first set up was an out of = the box set up, the main site was on port 80, using classic authentication.= =A0 With the path modification in the mcf-sharepoint-connector.jar, it work= ed very well.

I set up active directory on that same server to authenticate via= NTLM

The second server had the site on https on port 443, had= claims based authentication using ADFS and kerberos.=A0 I had to modify th= e mcf-sharepoint-connector.jar and MCPermissions.wsp to get this to work ar= ound the lack of SID's returned from the permissions webservice.

In this case, Active Directory and ADFS were set up on separate A= WS servers




On Thu, Sep 19, 2013 at 12:23 AM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Will,

The = path stuff we're already dealing with - see the CONNECTORS-772 branch.= =A0 But what we are having trouble with is something much more fundamental.= =A0 On Dmitry's AWS instance, when you talk to the web services for a r= oot site, it works fine.=A0 But as soon as you add a subsite path into the = URL, it *seems* to work fine, but actually behaves as though you never spec= ified any subsite at all - it returns root site information only.=A0 On thi= s system, this occurs for ALL web services, even Microsoft's.=A0 The re= ason is that the value of SPContext.Current.Web never points to the subsite= you specified.=A0 The result is that you cannot use SharePoint subsites wi= th ManifoldCF without causing havoc.

Does this sound completely unfamiliar to you?=A0 If you never enc= ountered it, then we should compare how these instances were set up, unless= you have any further ideas.

Thanks,
Karl



On Wed, Sep 18, 2013 at 10:12 AM, Will P= arkinson <parkinson.will@gmail.com> wrote:
Hey Karl (and Dmitry)

For AWS,= i had to modify the way the the relPath in the in the addFile function in = the FileStream class (in SharepointRepository.java) calculated the modified= Path

Essentially, i ensured that the relPath always contains the site = as part of the path

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 if (site= Name !=3D "") {
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 int siteInd =3D relPath.indexOf(siteName);
=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 if (siteInd =3D=3D -1 || siteInd= > 3) {
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 relPa= th =3D siteName + relPath;
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 }
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 }

Which fixed my pathing issue and the index out of bounds error= s.

I have also made many other modification to cope with = AD and claims based auth and compatibility with Sharepoint 2013

Dmitry, i have uploaded my modified mcf-sharepoint-connector= .jar and MCPermissions WSP if you would like to try them out

ht= tp://pngnetworks.com/sharepoint-2010-claims.zip

Just make sure you back up your current ones as this is stil= l very much in development :)

Also, the logging is very v= erbose.

Cheers,

Will


On Wed, Sep 18, 2013 at 11:41 PM, Karl W= right <daddywri@gmail.com> wrote:
Hi Will,
When you folks set up YOUR AWS= instance, did it work with MCF out of the box?=A0 Or did you need to do so= mething?=A0 And, if so, what did you do?

Karl



On Wed, Sep 18, 2013 at 9:28 AM, Will Pa= rkinson <parkinson.will@gmail.com> wrote:
Yes that's right, only really interested in the site t= hat you are trying to crawl
<= br>
On Wed, Sep 18, 2013 at 11:25 PM, Dmitry = Goldenberg <dgoldenberg@kmwllc.com> wrote:
Will,
=A0
For Sh= arePoint - 80, the output is
=A0
NTAuthenticationProviders=A0=A0=A0=A0=A0=A0 : (STRING) &= quot;NTLM"
=A0
I assume we're not interested in the Default Web Sit= e; for that, the output is simply "The parameter NTAuthenticationProvi= ders is not set at this node."
=A0
- Dmitry


On Wed, Sep 18, 2013 at= 9:16 AM, Will Parkinson <parkinson.will@gmail.com> w= rote:
If you open IIS manager and click on site= s, it is displayed in the ID column (see screenshot attached)


On Wed,= Sep 18, 2013 at 10:55 PM, Dmitry Goldenberg <dgoldenberg@kmwllc.com= > wrote:
Hi Will,
= Sorry, what is the "sharepoint website number" in= that invokation?=A0
- Dmitry


On Wed, Sep 18, 2013 at 8:= 53 AM, Will Parkinson <parkinson.will@gmail.com> wrot= e:
Hi Dmitry

Jus= t out of interest, what does the following command output on your system
cd to C:\inetpub\adminscripts

cscript adsutil.vbs get w= 3svc/<put your sharepoint website number here>/root/NTAuthenticationP= roviders

Cheers,

Will


<= div class=3D"gmail_quote">On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright <= daddywri@gmail.com> wrote:
"This is the second time I= 'm encountering the issue which leads me to believe it's a quirk of= IIS and/or SharePoint."

It cannot be just a quirk of SharePoint because SharePoint's = UI etc could not create or work with subsites if that was true.=A0 It may w= ell be a configuration issue with IIS, which is indeed what I suspect.=A0 I= have pinged all the resources I know of to try and get some insight as to = why this is happening.


"Perhaps this is something that can be worked into the 'fabric= ' of ManifoldCF as a workaround for a known issue."

L= ike I said before, this is a huge amount of work, tantamount to rewriting m= ost of the connector.=A0 If this is what you want to request, that is your = option, but there is no way we'd complete any of this work before Decem= ber/January at the earliest.


"Just to understand this a bit better, the main breakage here is t= hat the wildcards don't work properly, right? "

No, it means that ManifoldCF cannot get at any data of any kind associated= with a SharePoint subsite.=A0 Accessing root data works fine.=A0 If you tr= y to crawl as things are now, you must disable all subsites and just crawl = the root site, or you will crawl the same things with longer and longer pat= hs indefinitely.

Karl





On Wed, Sep 18, 2013 at 8:38 AM, D= mitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
This i= s the second time I'm encountering the issue which leads me to believe = it's a quirk of IIS and/or SharePoint. Perhaps this is something that c= an be worked into the 'fabric' of ManifoldCF as a workaround for a = known issue. I understand that it may have far reaching tenticles but I won= der if that's really the only option...
=A0
Just to understand this a bit better, the main breakage = here is that the wildcards don't work properly, right?=A0 In theory if = I have a repo connector config which lists specific library and list paths,= things should work?=A0 It's only when the /* types of wildcards are in= cluded, we're in trouble?
=A0
- Dmitry


On Wed, Sep 18, 2013 at= 8:07 AM, Karl Wright <daddywri@gmail.com> wrote:
Apparently it does depend on how you get to the web service, whic= h does argue that it is an IIS issue.

Karl



On Tue, Sep 17, 201= 3 at 5:44 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Dmitry,

As= discussed privately I had a look at your system.=A0 What is happening is that the C# static=20 SPContext.Current.Web is not reflecting the subsite in any url that=20 contains a subsite.=A0 In other words, the URL coming in might be "http://servername/subsite1/_vti_bin/MCPermissions.asmx", bu= t the MCPermissions.asmx plugin will think it is being executed in the root= context ("http://serv= ername").=A0 That's pretty broken behavior, so I'm guessin= g that the problem is that either IIS or SharePoint is somehow misconfigured to d= o this, and the=20 web services would then begin to work right again.=A0 But I have no idea ho= w this should actually be fixed.

Will Parkinson, one of the su= bscribers of this list, may find the symptoms meaningful, since he set up a= n AWS SharePoint instance before.=A0 I hope he will respond in a helpful wa= y.=A0 Until then, I think we are stuck.

Thanks,
Karl



On Tue, Sep 17, 2013 at 9:49 AM, Dmitry= Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
It = looks like I'll be able to get access for you to the test system we'= ;re using. Would you be interested in working with the system directly? I c= ertainly don't mind doing some testing but I thought we'd speed thi= ngs up this way. If so, could you email me from a more private account so w= e can set this up?
=A0
Thanks,
- Dmitry
=A0


On Tue,= Sep 17, 2013 at 7:38 AM, Karl Wright <daddywri@gmail.com> = wrote:
Hi Dmitry,

Another inter= esting bit from the log:

>>>>>>
DEBUG 2013-09-16 11:43:56,799 (Worker threa= d '7') - SharePoint: Library list: '/_catalogs/lt/Forms/AllItem= s.aspx', 'List Template Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master= Page Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7&#= 39;) - SharePoint: Library list: '/Shared Documents/Forms/AllItems.aspx= ', 'Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'=
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:= Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages= 9;
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solutio= n Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7')= - SharePoint: Library list: '/Style Library/Forms/AllItems.aspx', = 'Style Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library= 1'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - Shar= ePoint: Library list: '/_catalogs/theme/Forms/AllItems.aspx', '= Theme Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Galle= ry'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - Shar= ePoint: Checking whether to include library '/Abcd/Klmnopqr/Klmnopqr/De= fghij/Defghij/Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exa= ctly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worke= r thread '7') - SharePoint: Including library '/Abcd/Klmnopqr/K= lmnopqr/Defghij/Defghij/Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Che= cking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defgh= ij/SiteAssets'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7&#= 39;) - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Sit= eAssets' exactly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Inc= luding library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'=
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint:= Checking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/D= efghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Lib= rary '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly ma= tched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker threa= d '7') - SharePoint: Including library '/Abcd/Klmnopqr/Klmnopqr= /Defghij/Defghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Che= cking whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defgh= ij/Style Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '= 7') - SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/= Style Library' exactly matched rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Inc= luding library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library&#= 39;
<<<<<<

This time it appears that it is the = Lists service that is broken and does not recognize the parent site.

I haven't corrected this problem yet since now I am beginning to wo= nder if *any* of the web services under Amazon work at all for subsites.=A0= We may be better off implementing everything we need in the MCPermissions = service.=A0 I will ponder this as I continue to research the logs.

It's still valuable to check my getSites() implementation.=A0= I'll be doing another round of work tonight on the plugin.


=
Karl


On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com>= wrote:
The augmented plugin can be downloaded from http://people.apache.org/~kwright/MetaCarta.SharePoint= .MCPermissionsService.wsp .=A0 The revised connector code is also ready= , and should be checked out and built from https://s= vn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772 .

Once you set it all up, you can see if it is doing the right thing by j= ust trying to drill down through subsites in the UI.=A0 You should always s= ee a list of subsites that is appropriate for the context you are in; if th= is does not happen it is not working.

Thanks,
Karl


<= br>
On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Golde= nberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
I can = see how preloading the list of subsites may be less optimal.. The advantage= of doing it this way is one call and you've got the structure in memor= y, which may be OK unless there are sites with a ton of subsites which may = stress out memory. The disadvantage is having to throw this structure aroun= d..
=A0
Yes, I'll certainly help test out your changes, just= let me know when they're available.
=A0
Thanks,
- Dmitry


On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

Thanks for the code snip= pet.=A0 I'd prefer, though, to not preload the entire site structure in= memory.=A0 Probably it would be better to just add another method to the M= anifoldCF SharePoint 2010 plugin.=A0 More methods are going to be added any= way to support Claim Space Authentication, so I guess this would be just on= e more.

We honestly have never seen this problem before - so it's not= just flakiness, it has something to do with the installation, I'm cert= ain.=A0 At any rate, I'll get going right away on a workaround - if you= are willing to test what I produce.=A0 I'm also certain there is at le= ast one other issue, but hopefully that will become clearer once this one i= s resolved.

Thanks,
Karl




On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenb= erg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
&= gt;> subsite discovery is effectively disabled except directly under the= root site
=A0
Yes. Come to think of it, I once came across this = problem while implementing a SharePoint connector.=A0 I'm not sure whet= her it's exactly what's happening with the issue we're discussi= ng but looks like it.
=A0
I started off by using multiple getWebCollection calls t= o get child subsites of sites and trying=A0to navigate down that way. The p= roblem was that getWebCollection was always returning the immediate subsite= s of the root site no matter whether you're at the root or below, so I = ended up generating infinite loops.
=A0
I switched over to using=A0a single getAllSubWebCollecti= on=A0call and caching its results. That call returns the full list of all s= ubsites as pairs of Title and Url.=A0 I had a POJO similar to the one below= which held the list of sites and contained logic for enumerating the child= sites, given the URL of a (parent) site.=A0 From what I recall,=A0getWebCo= llection works inconsistently, either across SP versions or across installa= tions, but the logic below should work in any case.
=A0
*** public class SubSiteCollection -- holds a list of Cr= awledSite pojo's each of which is a { title, url }.
=A0
=
*** SubSiteCollection has the following:
=A0
=A0pu= blic List<CrawledSite> getImmediateSubSites(String siteUrl) {
=A0=A0List<CrawledSite> subSites =3D new ArrayList<CrawledSite>= ();
=A0=A0for (CrawledSite site : sites) {
=A0=A0=A0if (isChil= dOf(siteUrl, site.getUrl().toString())) {
=A0=A0=A0=A0subSites.add(site)= ;
=A0=A0=A0}
=A0=A0}
=A0=A0return subSites;
=A0}
=A0
=A0private stat= ic boolean isChildOf(String parentUrl, String urlToCheck) {
=A0=A0final = String parent =3D normalizeUrl(parentUrl);
=A0=A0final String child =3D = normalizeUrl(urlToCheck);
=A0=A0boolean ret =3D false;
=A0=A0if (child.startsWith(parent)) = {
=A0=A0=A0String remainder =3D child.substring(parent.length());
=A0= =A0=A0ret =3D StringUtils.countOccurrencesOf(remainder, SLASH) =3D=3D 1;=A0=A0}
=A0=A0return ret;
=A0}
=A0
=A0private static String normalizeUrl(String url) {
= =A0=A0return ((url.endsWith(SLASH)) ? url : url + SLASH).toLowerCase();
= =A0}
=A0
- Dmitry
=A0


On Mon, Sep 16, 2013 at 2:54 PM, Karl Wr= ight <daddywri@gmail.com> wrote:
Hi Dmitry,

Have a look at= this sequence also:

>>>>>>
DEBUG 2013-09-16 11= :43:56,817 (Worker thread '8') - SharePoint: Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', 'Abcd= '
DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Sub= site list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/De= fghij', 'Defghij'
DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Sub= site list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/K= lmnopqr', 'Klmnopqr'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Che= cking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: S= ite '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path &= #39;/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Inc= luding site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
DEBUG 2013-09-16= 11:43:56,818 (Worker thread '8') - SharePoint: Checking whether to= include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Sit= e '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path = '/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -= SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'<= br> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Che= cking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr= 9;
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoin= t: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rul= e path '/*'
DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Inc= luding site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'

<<= <<<<

This is using the GetSites(String parent) met= hod with a site name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getti= ng back three sites (!!).=A0 The parent path is not correct, obviously, but= nevertheless this one way in which paths are getting completely messed up.= =A0 It *looks* like the Webs web service is broken in such a way as to igno= re the URL coming in, except for the base part, which means that subsite di= scovery is effectively disabled except directly under the root site.

This might still be OK if it is not possible to create subsites o= f subsites in this version of SharePoint.=A0 Can you confirm that this is o= r is not possible?

Karl



On Mon, Sep 16, 2013 at 2:42 PM, Karl Wr= ight <daddywri@gmail.com> wrote:
"This is everything that got generated, fro= m the very beginning"

Well, something isn't right.=A0= What I expect to see that I don't right up front are:

- A webs "getWebCollection" invocation for /_vti_bin/w= ebs.asmx
- Two lists "getListCollection" invocations for /_vti_= bin/lists.asmx

Instead the first transactions = I see are from already busted URLs - which make no sense since there would = be no way they should have been able to get queued yet.

So there are a number of possibilities.=A0 First, maybe the = log isn't getting cleared out, and the session in question therefore st= arts somewhere in the middle of manifoldcf.log.1.=A0 But no:

>>= ;>>>>
C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
grep: i= nput lines truncated - result questionable
<<<<&l= t;<

Nevertheless there are some interesting points her= e.=A0 First, note the following response, which I've been able to deter= mine is against "Test Library 1":

>>>>>>
DEBUG 2013-09-16 13:02:31,590 (Worker threa= d '23') - SharePoint: getListItems xml response: '<GetListIt= ems xmlns=3D"http://schemas.microsoft.com/sharepoint/soap/d= irectory/"><GetListItemsResponse xmlns=3D""><= ;GetListItemsResult FileRef=3D"SitePages/Home.aspx"/></GetL= istItemsResponse></GetListItems>'
DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: Ch= ecking whether to include document '/SitePages/Home.aspx'
DEBUG = 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: File = 9;/SitePages/Home.aspx' exactly matched rule path '/*'
DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: In= cluding file '/SitePages/Home.aspx'
=A0WARN 2013-09-16 13:02:31,= 590 (Worker thread '23') - Sharepoint: Unexpected relPath structure= ; path is '/SitePages/Home.aspx', but expected <list/library>= length of 26
<<<<<<

The FileRef in this case is poin= ting at what, exactly?=A0 Is there a SitePages/Home.aspx in the "Test = Library 1" library?=A0 Or does it mean to refer back to the root site = with this URL construction?=A0 And since this is supposedly at the root lev= el, how come the combined site + library name comes out to 26??=A0 I get 15= , which leaves 11 characters unaccounted for.

I'm still looking at the logs to see if I can glean key = information.=A0 Later, if I could set up a crawl against the sharepoint ins= tance in question, that would certainly help.=A0 I can readily set up an ss= h tunnel if that is what is required.=A0 But I won't be able to do it u= ntil I get home tonight.

Karl



On Mon, Sep 16, 2013 at 1:58 PM= , Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
This i= s everything that got generated, from the very beginning, meaning that I di= d a fresh build, new database, new connection definitions, start. The log m= ust have rolled but the .1 log is included.
=A0
If I were to get you access to the actual test system, w= ould you mind taking a look? It may be more efficient than sending logs..
=A0
- Dmitry


On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <= daddywri@gmail.com> wrote:
These logs are different but have exactly the same pr= oblem; they start in the middle when the crawl is already well underway.=A0= I'm wondering if by chance you have more than one agents process runni= ng or something?=A0 Or maybe the log is rolling and stuff is getting lost?= =A0 What's there is not what I would expect to see, at all.

I *did* manage to find two transactions that look like they = might be helpful, but because the *results* of those transactions are requi= red by transactions that take place minutes *before* in the log, I have no = confidence that I'm looking at anything meaningful.=A0 But I'll get= back to you on what I find nonetheless.

If you decide repeat this exercise, try watching the log wit= h "tail -f" before starting the job.=A0 You should not see any lo= g contents at all until the job is started.

Karl


On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <dgoldenber= g@kmwllc.com> wrote:
Karl,
=A0
Attach= ed please find logs which start at the beginning. I started from a fresh bu= ild (clean db etc.), the logs start at server start, then I create the outp= ut connection and the repo connection, then the job, and then=A0I fire off= =A0the job. I aborted the execution about a minute into it or so.=A0 That&#= 39;s all that's in the logs with:

org.apache.manifoldcf.connectors=3DDEBUG

log4j.logger.httpclient.wire= .header=3DDEBUG
log4j.logger.org.apache.commons.httpclient=3DDEBUG

- Dmitry



On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

Are you sure these are th= e right logs?
- They start right in the middle of a crawl
- They are already in a broken state when they start, e.g. the kinds of th= ings that are being looked up are already nonsense paths

I need to see logs from the BEGINNING of a fresh crawl to see how the n= onsense paths happen.

Thanks,
Karl




On Mon, S= ep 16, 2013 at 11:52 AM, Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Karl,
=A0
I'= ve generated logs with details as we discussed.
=A0
The job was created afresh, as before:
Path rules:
/= * file include
/* library include
/* list include
/* site include
Metadata:
/* include true=
The logs are attached.
- Dmitry

On Mon, Sep 16= , 2013 at 11:20 AM, Karl Wright <daddywri@gmail.com> wrote:=
"Do you think th= at this issue is generic with regard to any Amz instance?"

I presume so, since you didn't apparently do anything special= to set one of these up.=A0 Unfortunately, such instances are not part of t= he free tier, so I am still constrained from setting one up for myself beca= use of household rules here.

"For now, I assume our only workaround is to list the paths of int= erest manually"

Depending on what is going wrong, that ma= y not even work.=A0 It looks like several SharePoint web service calls may = be affected, and not in a cleanly predictable way, for this to happen.

"is identification and extraction of attachments supported in the = SP connector?"

ManifoldCF in general leaves identificatio= n and extraction to the search engine.=A0 Solr, for instance uses Tika for = this, if so configured.=A0 You can configure your Solr output connection to= include or exclude specific mime types or extensions if you want to limit = what is attempted.

Karl



<= br>
On Mon, Sep 16, 2013 at 11:09 AM, Dmitry = Goldenberg <dgoldenberg@kmwllc.com> wrote:
Thanks, Karl. Do you think that this= issue is generic with regard to any Amz instance? I'm just wondering h= ow easily reproducible this may be..
=A0
For now, I assume our only workaround is to list the pat= hs of interest manually, i.e. add explicit rules for each library and list.=
=A0
A related subject - is identification and extraction of = attachments supported in the SP connector?=A0 E.g. if I have a Word doc att= ached to a Task list item, would that be extracted?=A0 So far, I see that l= ibrary content gets crawled and I'm getting the list item data but am n= ot sure what happens to the attachments.


On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <daddywri@gmail.com&= gt; wrote:
Hi Dmitry,

Thanks for the add= itional information.=A0 It does appear like the method that lists subsites = is not working as expected under AWS.=A0 Nor are some number of other metho= ds which supposedly just list the children of a subsite.

I've reopened CONNECTORS-772 to work on addressing this issue= .=A0 Please stay tuned.

Karl



On Mon, Sep 16, 2013 at 10:08 AM, Dmitr= y Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
Mos= t of the paths that get generated are listed in the attached log, they matc= h what shows up in the diag report. So I'm not sure where they diverge,= most of them just don't seem right.=A0 There are 3 subsites rooted in = the main site: Abcd, Defghij, Klmnopqr.=A0 It's strange that the connec= tor would try such paths as:

/Klmnopqr/Defghij/Defghij/Announcements/// -- there ar= e multiple repetitions of the same subsite on the path and to begin with, D= efghij is not a subsite of Klmnopqr, so why would it try this?=A0the /// at= the end doesn't seem correct either, unless I'm missing something = in how this pathing works.

/Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcement= s -- looks wrong. A docname is mixed into the path, a subsite ends up after= a docname?...

/Shared D= ocuments/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of iss= ues plus now somehow the docname got split with a forward slash?..

There are also a bunch of StringIndexOutOfBoundsException's.=A0 Perh= aps this logic doesn't fit with the pathing we're seeing on this am= z-based installation?

I'd expect the logic to just know that root= contains 3 subsites, and work off that. Each subsite has a specific list o= f libraries and lists, etc. It seems odd that the connector gets into this = matching pattern, and tries what looks like thousands of variations (I abor= ted the execution).

- Dmitry



On Mon, Sep 16, = 2013 at 7:56 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Dmitry,

To clar= ify, the way you would need to analyze this is to run a crawl with the wild= cards as you have selected, abort if necessary after a while, and then use = the Document Status report to list the document identifiers that had been g= enerated.=A0 Find a document identifier that you believe represents a path = that is illegal, and figure out what SOAP getChild call caused the problem = by returning incorrect data.=A0 In other words, find the point in the path = where the path diverges from what exists into what doesn't exist, and g= o back in the ManifoldCF logs to find the particular SOAP request that led = to the issue.

I'd expect from your description that the problem lies with getting= child sites given a site path, but that's just a guess at this point.<= br>
Karl



On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Dmitry,

I don't understand w= hat you mean by "I've tried the set of wildcards as below and I se= em to be running into a lot of cycles, where various subsite folders are appended to each other and an extraction of data at all of those locations is attempted".=A0= =A0 If you are seeing cycles it means that document discovery is still fail= ing in some way.=A0 For each folder/library/site/subsite, only the children= of that folder/library/site/subsite should be appended to the path - ever.=

If you can give a specific example, preferably including the soap back-= and-forth, that would be very helpful.
Karl



On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <dgoldenberg@kmwllc.com> wrote:
Hi Karl,
=A0
Qui= ck question. Is there an easy way to configure an SP repo connection for cr= awling of all content, from the root site all the way down?
=A0
I've tried the set of wildcards as below and I seem = to be running into a lot of cycles, where various subsite folders are appen= ded to each other and an extraction of data at all of those locations is at= tempted. Ideally I'd like to avoid having to construct an exact set of = paths because the set may change, especially with new content being added.<= /div>
=A0
Path rules:
/* file include
/* library include /* list include
/* site include
=A0
Metadata:/* include true
=A0
I'd also like to pull down any= files attached to list items. I'm hoping that some type of "/* fi= le include" should do it, once I figure out how to safely include all = content.
=A0
Thanks,
- Dmitry












































--001a1133073a16941c04e6ae52ec--