manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@kmwllc.com>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Wed, 18 Sep 2013 15:16:02 GMT
SharePoint was not installed by a domain user (the Windows instance is not
on a domain).

This is not a canned AWS SharePoint installation; an engineer on the team
installed it, using the standard installer program, I believe.


On Wed, Sep 18, 2013 at 10:34 AM, Will Parkinson
<parkinson.will@gmail.com>wrote:

> Dmitry, do you know if Sharepoint was installed by a domain user?  I have
> heard of issues with Sharepoint if not installed using a domain user (e.g.
> DOMAIN\someuser)
>
>
> On Thu, Sep 19, 2013 at 12:31 AM, Will Parkinson <parkinson.will@gmail.com
> > wrote:
>
>> No, i didnt have that issue.  The issue i had was the // and ///
>> references being added in the wrong places in the page URL's
>>
>> I was getting things like
>>
>>  /Site Name/Lib///rary/test.aspx
>>
>> My first set up was an out of the box set up, the main site was on port
>> 80, using classic authentication.  With the path modification in the
>> mcf-sharepoint-connector.jar, it worked very well.
>>
>> I set up active directory on that same server to authenticate via NTLM
>>
>> The second server had the site on https on port 443, had claims based
>> authentication using ADFS and kerberos.  I had to modify the
>> mcf-sharepoint-connector.jar and MCPermissions.wsp to get this to work
>> around the lack of SID's returned from the permissions webservice.
>>
>> In this case, Active Directory and ADFS were set up on separate AWS
>> servers
>>
>>
>>
>>
>> On Thu, Sep 19, 2013 at 12:23 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Will,
>>>
>>> The path stuff we're already dealing with - see the CONNECTORS-772
>>> branch.  But what we are having trouble with is something much more
>>> fundamental.  On Dmitry's AWS instance, when you talk to the web services
>>> for a root site, it works fine.  But as soon as you add a subsite path into
>>> the URL, it *seems* to work fine, but actually behaves as though you never
>>> specified any subsite at all - it returns root site information only.  On
>>> this system, this occurs for ALL web services, even Microsoft's.  The
>>> reason is that the value of SPContext.Current.Web never points to the
>>> subsite you specified.  The result is that you cannot use SharePoint
>>> subsites with ManifoldCF without causing havoc.
>>>
>>> Does this sound completely unfamiliar to you?  If you never encountered
>>> it, then we should compare how these instances were set up, unless you have
>>> any further ideas.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Wed, Sep 18, 2013 at 10:12 AM, Will Parkinson <
>>> parkinson.will@gmail.com> wrote:
>>>
>>>> Hey Karl (and Dmitry)
>>>>
>>>> For AWS, i had to modify the way the the relPath in the in the addFile
>>>> function in the FileStream class (in SharepointRepository.java) calculated
>>>> the modifiedPath
>>>>
>>>> Essentially, i ensured that the relPath always contains the site as
>>>> part of the path
>>>>
>>>>               if (siteName != "") {
>>>>                     int siteInd = relPath.indexOf(siteName);
>>>>                     if (siteInd == -1 || siteInd > 3) {
>>>>                         relPath = siteName + relPath;
>>>>                     }
>>>>                 }
>>>>
>>>>
>>>> Which fixed my pathing issue and the index out of bounds errors.
>>>>
>>>> I have also made many other modification to cope with AD and claims
>>>> based auth and compatibility with Sharepoint 2013
>>>>
>>>> Dmitry, i have uploaded my modified mcf-sharepoint-connector.jar and
>>>> MCPermissions WSP if you would like to try them out
>>>>
>>>> http://pngnetworks.com/sharepoint-2010-claims.zip
>>>>
>>>> Just make sure you back up your current ones as this is still very much
>>>> in development :)
>>>>
>>>> Also, the logging is very verbose.
>>>>
>>>> Cheers,
>>>>
>>>> Will
>>>>
>>>>
>>>> On Wed, Sep 18, 2013 at 11:41 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>
>>>>> Hi Will,
>>>>> When you folks set up YOUR AWS instance, did it work with MCF out of
>>>>> the box?  Or did you need to do something?  And, if so, what did you do?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 18, 2013 at 9:28 AM, Will Parkinson <
>>>>> parkinson.will@gmail.com> wrote:
>>>>>
>>>>>> Yes that's right, only really interested in the site that you are
>>>>>> trying to crawl
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 18, 2013 at 11:25 PM, Dmitry Goldenberg <
>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>
>>>>>>> Will,
>>>>>>>
>>>>>>> For SharePoint - 80, the output is
>>>>>>>
>>>>>>> NTAuthenticationProviders       : (STRING) "NTLM"
>>>>>>>
>>>>>>> I assume we're not interested in the Default Web Site; for that, the
>>>>>>> output is simply "The parameter NTAuthenticationProviders is not set at
>>>>>>> this node."
>>>>>>>
>>>>>>> - Dmitry
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 18, 2013 at 9:16 AM, Will Parkinson <
>>>>>>> parkinson.will@gmail.com> wrote:
>>>>>>>
>>>>>>>> If you open IIS manager and click on sites, it is displayed in the
>>>>>>>> ID column (see screenshot attached)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 18, 2013 at 10:55 PM, Dmitry Goldenberg <
>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>
>>>>>>>>> **Hi Will,
>>>>>>>>> Sorry, what is the "sharepoint website *number*" in that
>>>>>>>>> invokation?
>>>>>>>>> - Dmitry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 18, 2013 at 8:53 AM, Will Parkinson <
>>>>>>>>> parkinson.will@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dmitry
>>>>>>>>>>
>>>>>>>>>> Just out of interest, what does the following command output on
>>>>>>>>>> your system
>>>>>>>>>>
>>>>>>>>>> cd to C:\inetpub\adminscripts
>>>>>>>>>>
>>>>>>>>>> *cscript adsutil.vbs get w3svc/<put your sharepoint website
>>>>>>>>>> number here>/root/NTAuthenticationProviders*
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Will
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 18, 2013 at 10:44 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> "This is the second time I'm encountering the issue which leads
>>>>>>>>>>> me to believe it's a quirk of IIS and/or SharePoint."
>>>>>>>>>>>
>>>>>>>>>>> It cannot be just a quirk of SharePoint because SharePoint's UI
>>>>>>>>>>> etc could not create or work with subsites if that was true.  It may well
>>>>>>>>>>> be a configuration issue with IIS, which is indeed what I suspect.  I have
>>>>>>>>>>> pinged all the resources I know of to try and get some insight as to why
>>>>>>>>>>> this is happening.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> "Perhaps this is something that can be worked into the 'fabric'
>>>>>>>>>>> of ManifoldCF as a workaround for a known issue."
>>>>>>>>>>>
>>>>>>>>>>> Like I said before, this is a huge amount of work, tantamount to
>>>>>>>>>>> rewriting most of the connector.  If this is what you want to request, that
>>>>>>>>>>> is your option, but there is no way we'd complete any of this work before
>>>>>>>>>>> December/January at the earliest.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> "Just to understand this a bit better, the main breakage here is
>>>>>>>>>>> that the wildcards don't work properly, right? "
>>>>>>>>>>>
>>>>>>>>>>> No, it means that ManifoldCF cannot get at any data of any kind
>>>>>>>>>>> associated with a SharePoint subsite.  Accessing root data works fine.  If
>>>>>>>>>>> you try to crawl as things are now, you must disable all subsites and just
>>>>>>>>>>> crawl the root site, or you will crawl the same things with longer and
>>>>>>>>>>> longer paths indefinitely.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 18, 2013 at 8:38 AM, Dmitry Goldenberg <
>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> This is the second time I'm encountering the issue which leads
>>>>>>>>>>>> me to believe it's a quirk of IIS and/or SharePoint. Perhaps this is
>>>>>>>>>>>> something that can be worked into the 'fabric' of ManifoldCF as a
>>>>>>>>>>>> workaround for a known issue. I understand that it may have far reaching
>>>>>>>>>>>> tenticles but I wonder if that's really the only option...
>>>>>>>>>>>>
>>>>>>>>>>>> Just to understand this a bit better, the main breakage here is
>>>>>>>>>>>> that the wildcards don't work properly, right?  In theory if I have a repo
>>>>>>>>>>>> connector config which lists specific library and list paths, things should
>>>>>>>>>>>> work?  It's only when the /* types of wildcards are included, we're in
>>>>>>>>>>>> trouble?
>>>>>>>>>>>>
>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 18, 2013 at 8:07 AM, Karl Wright <
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Someone else was having a similar problem. See
>>>>>>>>>>>>> http://social.technet.microsoft.com/Forums/sharepoint/en-US/e4b53c63-b89a-4356-a7b0-6ca7bfd22826/getting-sharepoint-subsite-from-custom-webservice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apparently it does depend on how you get to the web service,
>>>>>>>>>>>>> which does argue that it is an IIS issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 5:44 PM, Karl Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As discussed privately I had a look at your system.  What is
>>>>>>>>>>>>>> happening is that the C# static SPContext.Current.Web is not reflecting the
>>>>>>>>>>>>>> subsite in any url that contains a subsite.  In other words, the URL coming
>>>>>>>>>>>>>> in might be "
>>>>>>>>>>>>>> http://servername/subsite1/_vti_bin/MCPermissions.asmx", but
>>>>>>>>>>>>>> the MCPermissions.asmx plugin will think it is being executed in the root
>>>>>>>>>>>>>> context ("http://servername").  That's pretty broken
>>>>>>>>>>>>>> behavior, so I'm guessing that the problem is that either IIS or SharePoint
>>>>>>>>>>>>>> is somehow misconfigured to do this, and the web services would then begin
>>>>>>>>>>>>>> to work right again.  But I have no idea how this should actually be fixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Will Parkinson, one of the subscribers of this list, may find
>>>>>>>>>>>>>> the symptoms meaningful, since he set up an AWS SharePoint instance
>>>>>>>>>>>>>> before.  I hope he will respond in a helpful way.  Until then, I think we
>>>>>>>>>>>>>> are stuck.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 9:49 AM, Dmitry Goldenberg <
>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like I'll be able to get access for you to the test
>>>>>>>>>>>>>>> system we're using. Would you be interested in working with the system
>>>>>>>>>>>>>>> directly? I certainly don't mind doing some testing but I thought we'd
>>>>>>>>>>>>>>> speed things up this way. If so, could you email me from a more private
>>>>>>>>>>>>>>> account so we can set this up?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 17, 2013 at 7:38 AM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another interesting bit from the log:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/_catalogs/lt/Forms/AllItems.aspx', 'List
>>>>>>>>>>>>>>>> Template Gallery'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/_catalogs/masterpage/Forms/AllItems.aspx',
>>>>>>>>>>>>>>>> 'Master Page Gallery'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/Shared Documents/Forms/AllItems.aspx', 'Shared
>>>>>>>>>>>>>>>> Documents'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/SitePages/Forms/AllPages.aspx', 'Site Pages'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/_catalogs/solutions/Forms/AllItems.aspx',
>>>>>>>>>>>>>>>> 'Solution Gallery'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/Style Library/Forms/AllItems.aspx', 'Style
>>>>>>>>>>>>>>>> Library'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/Test Library 1/Forms/AllItems.aspx', 'Test
>>>>>>>>>>>>>>>> Library 1'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme
>>>>>>>>>>>>>>>> Gallery'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part
>>>>>>>>>>>>>>>> Gallery'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Checking whether to include library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared
>>>>>>>>>>>>>>>> Documents' exactly matched rule path '/*'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Including library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Checking whether to include library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
>>>>>>>>>>>>>>>> exactly matched rule path '/*'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Including library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Checking whether to include library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
>>>>>>>>>>>>>>>> exactly matched rule path '/*'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Including library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Checking whether to include library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
>>>>>>>>>>>>>>>> exactly matched rule path '/*'
>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') -
>>>>>>>>>>>>>>>> SharePoint: Including library
>>>>>>>>>>>>>>>> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
>>>>>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This time it appears that it is the Lists service that is
>>>>>>>>>>>>>>>> broken and does not recognize the parent site.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't corrected this problem yet since now I am
>>>>>>>>>>>>>>>> beginning to wonder if *any* of the web services under Amazon work at all
>>>>>>>>>>>>>>>> for subsites.  We may be better off implementing everything we need in the
>>>>>>>>>>>>>>>> MCPermissions service.  I will ponder this as I continue to research the
>>>>>>>>>>>>>>>> logs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's still valuable to check my getSites() implementation.
>>>>>>>>>>>>>>>> I'll be doing another round of work tonight on the plugin.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The augmented plugin can be downloaded from
>>>>>>>>>>>>>>>>> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp.  The revised connector code is also ready, and should be checked out and
>>>>>>>>>>>>>>>>> built from
>>>>>>>>>>>>>>>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Once you set it all up, you can see if it is doing the
>>>>>>>>>>>>>>>>> right thing by just trying to drill down through subsites in the UI.  You
>>>>>>>>>>>>>>>>> should always see a list of subsites that is appropriate for the context
>>>>>>>>>>>>>>>>> you are in; if this does not happen it is not working.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can see how preloading the list of subsites may be less
>>>>>>>>>>>>>>>>>> optimal.. The advantage of doing it this way is one call and you've got the
>>>>>>>>>>>>>>>>>> structure in memory, which may be OK unless there are sites with a ton of
>>>>>>>>>>>>>>>>>> subsites which may stress out memory. The disadvantage is having to throw
>>>>>>>>>>>>>>>>>> this structure around..
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, I'll certainly help test out your changes, just let
>>>>>>>>>>>>>>>>>> me know when they're available.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the code snippet.  I'd prefer, though, to not
>>>>>>>>>>>>>>>>>>> preload the entire site structure in memory.  Probably it would be better
>>>>>>>>>>>>>>>>>>> to just add another method to the ManifoldCF SharePoint 2010 plugin.  More
>>>>>>>>>>>>>>>>>>> methods are going to be added anyway to support Claim Space Authentication,
>>>>>>>>>>>>>>>>>>> so I guess this would be just one more.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We honestly have never seen this problem before - so
>>>>>>>>>>>>>>>>>>> it's not just flakiness, it has something to do with the installation, I'm
>>>>>>>>>>>>>>>>>>> certain.  At any rate, I'll get going right away on a workaround - if you
>>>>>>>>>>>>>>>>>>> are willing to test what I produce.  I'm also certain there is at least one
>>>>>>>>>>>>>>>>>>> other issue, but hopefully that will become clearer once this one is
>>>>>>>>>>>>>>>>>>> resolved.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> >> subsite discovery is effectively disabled except
>>>>>>>>>>>>>>>>>>>> directly under the root site
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes. Come to think of it, I once came across this
>>>>>>>>>>>>>>>>>>>> problem while implementing a SharePoint connector.  I'm not sure whether
>>>>>>>>>>>>>>>>>>>> it's exactly what's happening with the issue we're discussing but looks
>>>>>>>>>>>>>>>>>>>> like it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I started off by using multiple getWebCollection calls
>>>>>>>>>>>>>>>>>>>> to get child subsites of sites and trying to navigate down that way. The
>>>>>>>>>>>>>>>>>>>> problem was that getWebCollection was always returning the immediate
>>>>>>>>>>>>>>>>>>>> subsites of the root site no matter whether you're at the root or below, so
>>>>>>>>>>>>>>>>>>>> I ended up generating infinite loops.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I switched over to using a single
>>>>>>>>>>>>>>>>>>>> getAllSubWebCollection call and caching its results. That call returns the
>>>>>>>>>>>>>>>>>>>> full list of all subsites as pairs of Title and Url.  I had a POJO similar
>>>>>>>>>>>>>>>>>>>> to the one below which held the list of sites and contained logic for
>>>>>>>>>>>>>>>>>>>> enumerating the child sites, given the URL of a (parent) site.  From what I
>>>>>>>>>>>>>>>>>>>> recall, getWebCollection works inconsistently, either across SP versions or
>>>>>>>>>>>>>>>>>>>> across installations, but the logic below should work in any case.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *** public class SubSiteCollection -- holds a list of
>>>>>>>>>>>>>>>>>>>> CrawledSite pojo's each of which is a { title, url }.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *** SubSiteCollection has the following:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  public List<CrawledSite> getImmediateSubSites(String
>>>>>>>>>>>>>>>>>>>> siteUrl) {
>>>>>>>>>>>>>>>>>>>>   List<CrawledSite> subSites = new
>>>>>>>>>>>>>>>>>>>> ArrayList<CrawledSite>();
>>>>>>>>>>>>>>>>>>>>   for (CrawledSite site : sites) {
>>>>>>>>>>>>>>>>>>>>    if (isChildOf(siteUrl, site.getUrl().toString())) {
>>>>>>>>>>>>>>>>>>>>     subSites.add(site);
>>>>>>>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>   return subSites;
>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  private static boolean isChildOf(String parentUrl,
>>>>>>>>>>>>>>>>>>>> String urlToCheck) {
>>>>>>>>>>>>>>>>>>>>   final String parent = normalizeUrl(parentUrl);
>>>>>>>>>>>>>>>>>>>>   final String child = normalizeUrl(urlToCheck);
>>>>>>>>>>>>>>>>>>>>   boolean ret = false;
>>>>>>>>>>>>>>>>>>>>   if (child.startsWith(parent)) {
>>>>>>>>>>>>>>>>>>>>    String remainder = child.substring(parent.length());
>>>>>>>>>>>>>>>>>>>>    ret = StringUtils.countOccurrencesOf(remainder,
>>>>>>>>>>>>>>>>>>>> SLASH) == 1;
>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>   return ret;
>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  private static String normalizeUrl(String url) {
>>>>>>>>>>>>>>>>>>>>   return ((url.endsWith(SLASH)) ? url : url +
>>>>>>>>>>>>>>>>>>>> SLASH).toLowerCase();
>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Have a look at this sequence also:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Subsite list: '
>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd',
>>>>>>>>>>>>>>>>>>>>> 'Abcd'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Subsite list: '
>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij',
>>>>>>>>>>>>>>>>>>>>> 'Defghij'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Subsite list: '
>>>>>>>>>>>>>>>>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr',
>>>>>>>>>>>>>>>>>>>>> 'Klmnopqr'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Checking whether to include site
>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule
>>>>>>>>>>>>>>>>>>>>> path '/*'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Checking whether to include site
>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched
>>>>>>>>>>>>>>>>>>>>> rule path '/*'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Checking whether to include site
>>>>>>>>>>>>>>>>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched
>>>>>>>>>>>>>>>>>>>>> rule path '/*'
>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') -
>>>>>>>>>>>>>>>>>>>>> SharePoint: Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is using the GetSites(String parent) method with
>>>>>>>>>>>>>>>>>>>>> a site name of "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites
>>>>>>>>>>>>>>>>>>>>> (!!).  The parent path is not correct, obviously, but nevertheless this one
>>>>>>>>>>>>>>>>>>>>> way in which paths are getting completely messed up.  It *looks* like the
>>>>>>>>>>>>>>>>>>>>> Webs web service is broken in such a way as to ignore the URL coming in,
>>>>>>>>>>>>>>>>>>>>> except for the base part, which means that subsite discovery is effectively
>>>>>>>>>>>>>>>>>>>>> disabled except directly under the root site.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This might still be OK if it is not possible to create
>>>>>>>>>>>>>>>>>>>>> subsites of subsites in this version of SharePoint.  Can you confirm that
>>>>>>>>>>>>>>>>>>>>> this is or is not possible?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> "This is everything that got generated, from the very
>>>>>>>>>>>>>>>>>>>>>> beginning"
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Well, something isn't right.  What I expect to see
>>>>>>>>>>>>>>>>>>>>>> that I don't right up front are:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> - A webs "getWebCollection" invocation for
>>>>>>>>>>>>>>>>>>>>>> /_vti_bin/webs.asmx
>>>>>>>>>>>>>>>>>>>>>> - Two lists "getListCollection" invocations for
>>>>>>>>>>>>>>>>>>>>>> /_vti_bin/lists.asmx
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Instead the first transactions I see are from already
>>>>>>>>>>>>>>>>>>>>>> busted URLs - which make no sense since there would be no way they should
>>>>>>>>>>>>>>>>>>>>>> have been able to get queued yet.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So there are a number of possibilities.  First, maybe
>>>>>>>>>>>>>>>>>>>>>> the log isn't getting cleared out, and the session in question therefore
>>>>>>>>>>>>>>>>>>>>>> starts somewhere in the middle of manifoldcf.log.1.  But no:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
>>>>>>>>>>>>>>>>>>>>>> grep: input lines truncated - result questionable
>>>>>>>>>>>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Nevertheless there are some interesting points here.
>>>>>>>>>>>>>>>>>>>>>> First, note the following response, which I've been able to determine is
>>>>>>>>>>>>>>>>>>>>>> against "Test Library 1":
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>>>>>>>>>>>> SharePoint: getListItems xml response: '<GetListItems xmlns="
>>>>>>>>>>>>>>>>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/"><GetListItemsResponse
>>>>>>>>>>>>>>>>>>>>>> xmlns=""><GetListItemsResult
>>>>>>>>>>>>>>>>>>>>>> FileRef="SitePages/Home.aspx"/></GetListItemsResponse></GetListItems>'
>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>>>>>>>>>>>> SharePoint: Checking whether to include document '/SitePages/Home.aspx'
>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>>>>>>>>>>>> SharePoint: File '/SitePages/Home.aspx' exactly matched rule path '/*'
>>>>>>>>>>>>>>>>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>>>>>>>>>>>> SharePoint: Including file '/SitePages/Home.aspx'
>>>>>>>>>>>>>>>>>>>>>>  WARN 2013-09-16 13:02:31,590 (Worker thread '23') -
>>>>>>>>>>>>>>>>>>>>>> Sharepoint: Unexpected relPath structure; path is '/SitePages/Home.aspx',
>>>>>>>>>>>>>>>>>>>>>> but expected <list/library> length of 26
>>>>>>>>>>>>>>>>>>>>>> <<<<<<
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The FileRef in this case is pointing at what,
>>>>>>>>>>>>>>>>>>>>>> exactly?  Is there a SitePages/Home.aspx in the "Test Library 1" library?
>>>>>>>>>>>>>>>>>>>>>> Or does it mean to refer back to the root site with this URL construction?
>>>>>>>>>>>>>>>>>>>>>> And since this is supposedly at the root level, how come the combined site
>>>>>>>>>>>>>>>>>>>>>> + library name comes out to 26??  I get 15, which leaves 11 characters
>>>>>>>>>>>>>>>>>>>>>> unaccounted for.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm still looking at the logs to see if I can glean
>>>>>>>>>>>>>>>>>>>>>> key information.  Later, if I could set up a crawl against the sharepoint
>>>>>>>>>>>>>>>>>>>>>> instance in question, that would certainly help.  I can readily set up an
>>>>>>>>>>>>>>>>>>>>>> ssh tunnel if that is what is required.  But I won't be able to do it until
>>>>>>>>>>>>>>>>>>>>>> I get home tonight.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This is everything that got generated, from the very
>>>>>>>>>>>>>>>>>>>>>>> beginning, meaning that I did a fresh build, new database, new connection
>>>>>>>>>>>>>>>>>>>>>>> definitions, start. The log must have rolled but the .1 log is included.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If I were to get you access to the actual test
>>>>>>>>>>>>>>>>>>>>>>> system, would you mind taking a look? It may be more efficient than sending
>>>>>>>>>>>>>>>>>>>>>>> logs..
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> These logs are different but have exactly the same
>>>>>>>>>>>>>>>>>>>>>>>> problem; they start in the middle when the crawl is already well underway.
>>>>>>>>>>>>>>>>>>>>>>>> I'm wondering if by chance you have more than one agents process running or
>>>>>>>>>>>>>>>>>>>>>>>> something?  Or maybe the log is rolling and stuff is getting lost?  What's
>>>>>>>>>>>>>>>>>>>>>>>> there is not what I would expect to see, at all.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I *did* manage to find two transactions that look
>>>>>>>>>>>>>>>>>>>>>>>> like they might be helpful, but because the *results* of those transactions
>>>>>>>>>>>>>>>>>>>>>>>> are required by transactions that take place minutes *before* in the log, I
>>>>>>>>>>>>>>>>>>>>>>>> have no confidence that I'm looking at anything meaningful.  But I'll get
>>>>>>>>>>>>>>>>>>>>>>>> back to you on what I find nonetheless.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If you decide repeat this exercise, try watching
>>>>>>>>>>>>>>>>>>>>>>>> the log with "tail -f" before starting the job.  You should not see any log
>>>>>>>>>>>>>>>>>>>>>>>> contents at all until the job is started.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg
>>>>>>>>>>>>>>>>>>>>>>>> <dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Attached please find logs which start at the
>>>>>>>>>>>>>>>>>>>>>>>>> beginning. I started from a fresh build (clean db etc.), the logs start at
>>>>>>>>>>>>>>>>>>>>>>>>> server start, then I create the output connection and the repo connection,
>>>>>>>>>>>>>>>>>>>>>>>>> then the job, and then I fire off the job. I aborted the execution about a
>>>>>>>>>>>>>>>>>>>>>>>>> minute into it or so.  That's all that's in the logs with:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG
>>>>>>>>>>>>>>>>>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Are you sure these are the right logs?
>>>>>>>>>>>>>>>>>>>>>>>>>> - They start right in the middle of a crawl
>>>>>>>>>>>>>>>>>>>>>>>>>> - They are already in a broken state when they
>>>>>>>>>>>>>>>>>>>>>>>>>> start, e.g. the kinds of things that are being looked up are already
>>>>>>>>>>>>>>>>>>>>>>>>>> nonsense paths
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I need to see logs from the BEGINNING of a fresh
>>>>>>>>>>>>>>>>>>>>>>>>>> crawl to see how the nonsense paths happen.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg <dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I've generated logs with details as we discussed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The job was created afresh, as before:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>>>>>>>>>>>> The logs are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Do you think that this issue is generic with
>>>>>>>>>>>>>>>>>>>>>>>>>>>> regard to any Amz instance?"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I presume so, since you didn't apparently do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything special to set one of these up.  Unfortunately, such instances are
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not part of the free tier, so I am still constrained from setting one up
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for myself because of household rules here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "For now, I assume our only workaround is to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list the paths of interest manually"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Depending on what is going wrong, that may not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> even work.  It looks like several SharePoint web service calls may be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected, and not in a cleanly predictable way, for this to happen.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "is identification and extraction of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> attachments supported in the SP connector?"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF in general leaves identification and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> extraction to the search engine.  Solr, for instance uses Tika for this, if
>>>>>>>>>>>>>>>>>>>>>>>>>>>> so configured.  You can configure your Solr output connection to include or
>>>>>>>>>>>>>>>>>>>>>>>>>>>> exclude specific mime types or extensions if you want to limit what is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> attempted.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg <dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Karl. Do you think that this issue is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generic with regard to any Amz instance? I'm just wondering how easily
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reproducible this may be..
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For now, I assume our only workaround is to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list the paths of interest manually, i.e. add explicit rules for each
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> library and list.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related subject - is identification and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extraction of attachments supported in the SP connector?  E.g. if I have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Word doc attached to a Task list item, would that be extracted?  So far, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> see that library content gets crawled and I'm getting the list item data
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but am not sure what happens to the attachments.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the additional information.  It
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does appear like the method that lists subsites is not working as expected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> under AWS.  Nor are some number of other methods which supposedly just list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the children of a subsite.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've reopened CONNECTORS-772 to work on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addressing this issue.  Please stay tuned.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg <dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Most of the paths that get generated are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> listed in the attached log, they match what shows up in the diag report. So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not sure where they diverge, most of them just don't seem right.  There
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are 3 subsites rooted in the main site: Abcd, Defghij, Klmnopqr.  It's
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange that the connector would try such paths as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements///
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- there are multiple repetitions of the same subsite on the path and to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> begin with, Defghij is not a subsite of Klmnopqr, so why would it try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this? the /// at the end doesn't seem correct either, unless I'm missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> something in how this pathing works.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /Test Library
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- looks wrong. A
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> docname is mixed into the path, a subsite ends up after a docname?...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /Shared
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same types of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues plus now somehow the docname got split with a forward slash?..
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There are also a bunch of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's.  Perhaps this logic doesn't fit with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathing we're seeing on this amz-based installation?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd expect the logic to just know that root
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains 3 subsites, and work off that. Each subsite has a specific list of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> libraries and lists, etc. It seems odd that the connector gets into this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matching pattern, and tries what looks like thousands of variations (I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> aborted the execution).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To clarify, the way you would need to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> analyze this is to run a crawl with the wildcards as you have selected,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abort if necessary after a while, and then use the Document Status report
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to list the document identifiers that had been generated.  Find a document
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> identifier that you believe represents a path that is illegal, and figure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> out what SOAP getChild call caused the problem by returning incorrect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data.  In other words, find the point in the path where the path diverges
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from what exists into what doesn't exist, and go back in the ManifoldCF
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logs to find the particular SOAP request that led to the issue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd expect from your description that the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem lies with getting child sites given a site path, but that's just a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> guess at this point.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't understand what you mean by "I've
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tried the set of wildcards as below and I seem to be running into a lot of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cycles, where various subsite folders are appended to each other and an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extraction of data at all of those locations is attempted".   If you are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seeing cycles it means that document discovery is still failing in some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way.  For each folder/library/site/subsite, only the children of that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you can give a specific example,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> preferably including the soap back-and-forth, that would be very helpful.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Goldenberg <dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Quick question. Is there an easy way to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure an SP repo connection for crawling of all content, from the root
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> site all the way down?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've tried the set of wildcards as below
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and I seem to be running into a lot of cycles, where various subsite
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> folders are appended to each other and an extraction of data at all of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> those locations is attempted. Ideally I'd like to avoid having to construct
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exact set of paths because the set may change, especially with new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> content being added.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd also like to pull down any files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> attached to list items. I'm hoping that some type of "/* file include"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should do it, once I figure out how to safely include all content.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message