manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@kmwllc.com>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Tue, 17 Sep 2013 13:49:50 GMT
Hi Karl,

It looks like I'll be able to get access for you to the test system we're
using. Would you be interested in working with the system directly? I
certainly don't mind doing some testing but I thought we'd speed things up
this way. If so, could you email me from a more private account so we can
set this up?

Thanks,
- Dmitry



On Tue, Sep 17, 2013 at 7:38 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Dmitry,
>
> Another interesting bit from the log:
>
> >>>>>>
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/_catalogs/lt/Forms/AllItems.aspx', 'List Template Gallery'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master Page Gallery'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/Shared Documents/Forms/AllItems.aspx', 'Shared Documents'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/SitePages/Forms/AllPages.aspx', 'Site Pages'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solution Gallery'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/Style Library/Forms/AllItems.aspx', 'Style Library'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library 1'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme Gallery'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Gallery'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
> whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared
> Documents'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exactly matched
> rule path '/*'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
> library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
> whether to include library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' exactly matched rule
> path '/*'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
> library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
> whether to include library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly matched rule
> path '/*'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
> library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
> whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style
> Library'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
> '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' exactly matched
> rule path '/*'
> DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
> library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
> <<<<<<
>
> This time it appears that it is the Lists service that is broken and does
> not recognize the parent site.
>
> I haven't corrected this problem yet since now I am beginning to wonder if
> *any* of the web services under Amazon work at all for subsites.  We may be
> better off implementing everything we need in the MCPermissions service.  I
> will ponder this as I continue to research the logs.
>
> It's still valuable to check my getSites() implementation.  I'll be doing
> another round of work tonight on the plugin.
>
>
> Karl
>
>
> On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> The augmented plugin can be downloaded from
>> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp.
 The revised connector code is also ready, and should be checked out and
>> built from
>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772 .
>>
>> Once you set it all up, you can see if it is doing the right thing by
>> just trying to drill down through subsites in the UI.  You should always
>> see a list of subsites that is appropriate for the context you are in; if
>> this does not happen it is not working.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Goldenberg <
>> dgoldenberg@kmwllc.com> wrote:
>>
>>> Karl,
>>>
>>> I can see how preloading the list of subsites may be less optimal.. The
>>> advantage of doing it this way is one call and you've got the structure in
>>> memory, which may be OK unless there are sites with a ton of subsites which
>>> may stress out memory. The disadvantage is having to throw this structure
>>> around..
>>>
>>> Yes, I'll certainly help test out your changes, just let me know when
>>> they're available.
>>>
>>> Thanks,
>>> - Dmitry
>>>
>>>
>>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> Thanks for the code snippet.  I'd prefer, though, to not preload the
>>>> entire site structure in memory.  Probably it would be better to just add
>>>> another method to the ManifoldCF SharePoint 2010 plugin.  More methods are
>>>> going to be added anyway to support Claim Space Authentication, so I guess
>>>> this would be just one more.
>>>>
>>>> We honestly have never seen this problem before - so it's not just
>>>> flakiness, it has something to do with the installation, I'm certain.  At
>>>> any rate, I'll get going right away on a workaround - if you are willing
to
>>>> test what I produce.  I'm also certain there is at least one other issue,
>>>> but hopefully that will become clearer once this one is resolved.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenberg <
>>>> dgoldenberg@kmwllc.com> wrote:
>>>>
>>>>> Karl,
>>>>>
>>>>> >> subsite discovery is effectively disabled except directly under
the
>>>>> root site
>>>>>
>>>>> Yes. Come to think of it, I once came across this problem while
>>>>> implementing a SharePoint connector.  I'm not sure whether it's exactly
>>>>> what's happening with the issue we're discussing but looks like it.
>>>>>
>>>>> I started off by using multiple getWebCollection calls to get child
>>>>> subsites of sites and trying to navigate down that way. The problem was
>>>>> that getWebCollection was always returning the immediate subsites of
the
>>>>> root site no matter whether you're at the root or below, so I ended up
>>>>> generating infinite loops.
>>>>>
>>>>> I switched over to using a single getAllSubWebCollection call and
>>>>> caching its results. That call returns the full list of all subsites
as
>>>>> pairs of Title and Url.  I had a POJO similar to the one below which
held
>>>>> the list of sites and contained logic for enumerating the child sites,
>>>>> given the URL of a (parent) site.  From what I recall, getWebCollection
>>>>> works inconsistently, either across SP versions or across installations,
>>>>> but the logic below should work in any case.
>>>>>
>>>>> *** public class SubSiteCollection -- holds a list of CrawledSite
>>>>> pojo's each of which is a { title, url }.
>>>>>
>>>>> *** SubSiteCollection has the following:
>>>>>
>>>>>  public List<CrawledSite> getImmediateSubSites(String siteUrl)
{
>>>>>   List<CrawledSite> subSites = new ArrayList<CrawledSite>();
>>>>>   for (CrawledSite site : sites) {
>>>>>    if (isChildOf(siteUrl, site.getUrl().toString())) {
>>>>>     subSites.add(site);
>>>>>    }
>>>>>   }
>>>>>   return subSites;
>>>>>  }
>>>>>
>>>>>  private static boolean isChildOf(String parentUrl, String urlToCheck)
>>>>> {
>>>>>   final String parent = normalizeUrl(parentUrl);
>>>>>   final String child = normalizeUrl(urlToCheck);
>>>>>   boolean ret = false;
>>>>>   if (child.startsWith(parent)) {
>>>>>    String remainder = child.substring(parent.length());
>>>>>    ret = StringUtils.countOccurrencesOf(remainder, SLASH) == 1;
>>>>>   }
>>>>>   return ret;
>>>>>  }
>>>>>
>>>>>  private static String normalizeUrl(String url) {
>>>>>   return ((url.endsWith(SLASH)) ? url : url + SLASH).toLowerCase();
>>>>>  }
>>>>>
>>>>> - Dmitry
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> Have a look at this sequence also:
>>>>>>
>>>>>> >>>>>>
>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>> Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd',
>>>>>> 'Abcd'
>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>> Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij',
>>>>>> 'Defghij'
>>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>>> Subsite list: '
>>>>>> http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr', 'Klmnopqr'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path '/*'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path
'/*'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rule path
'/*'
>>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>>
>>>>>> <<<<<<
>>>>>>
>>>>>> This is using the GetSites(String parent) method with a site name
of
>>>>>> "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites (!!).
 The
>>>>>> parent path is not correct, obviously, but nevertheless this one
way in
>>>>>> which paths are getting completely messed up.  It *looks* like the
Webs web
>>>>>> service is broken in such a way as to ignore the URL coming in, except
for
>>>>>> the base part, which means that subsite discovery is effectively
disabled
>>>>>> except directly under the root site.
>>>>>>
>>>>>> This might still be OK if it is not possible to create subsites of
>>>>>> subsites in this version of SharePoint.  Can you confirm that this
is or is
>>>>>> not possible?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>
>>>>>>> "This is everything that got generated, from the very beginning"
>>>>>>>
>>>>>>> Well, something isn't right.  What I expect to see that I don't
>>>>>>> right up front are:
>>>>>>>
>>>>>>> - A webs "getWebCollection" invocation for /_vti_bin/webs.asmx
>>>>>>> - Two lists "getListCollection" invocations for /_vti_bin/lists.asmx
>>>>>>>
>>>>>>> Instead the first transactions I see are from already busted
URLs -
>>>>>>> which make no sense since there would be no way they should have
been able
>>>>>>> to get queued yet.
>>>>>>>
>>>>>>> So there are a number of possibilities.  First, maybe the log
isn't
>>>>>>> getting cleared out, and the session in question therefore starts
somewhere
>>>>>>> in the middle of manifoldcf.log.1.  But no:
>>>>>>>
>>>>>>> >>>>>>
>>>>>>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
>>>>>>> grep: input lines truncated - result questionable
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> Nevertheless there are some interesting points here.  First,
note
>>>>>>> the following response, which I've been able to determine is
against "Test
>>>>>>> Library 1":
>>>>>>>
>>>>>>> >>>>>>
>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>>> getListItems xml response: '<GetListItems xmlns="
>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/"><GetListItemsResponse
>>>>>>> xmlns=""><GetListItemsResult
>>>>>>> FileRef="SitePages/Home.aspx"/></GetListItemsResponse></GetListItems>'
>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>>> Checking whether to include document '/SitePages/Home.aspx'
>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>>> File '/SitePages/Home.aspx' exactly matched rule path '/*'
>>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>>> Including file '/SitePages/Home.aspx'
>>>>>>>  WARN 2013-09-16 13:02:31,590 (Worker thread '23') - Sharepoint:
>>>>>>> Unexpected relPath structure; path is '/SitePages/Home.aspx',
but expected
>>>>>>> <list/library> length of 26
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> The FileRef in this case is pointing at what, exactly?  Is there
a
>>>>>>> SitePages/Home.aspx in the "Test Library 1" library?  Or does
it mean to
>>>>>>> refer back to the root site with this URL construction?  And
since this is
>>>>>>> supposedly at the root level, how come the combined site + library
name
>>>>>>> comes out to 26??  I get 15, which leaves 11 characters unaccounted
for.
>>>>>>>
>>>>>>> I'm still looking at the logs to see if I can glean key
>>>>>>> information.  Later, if I could set up a crawl against the sharepoint
>>>>>>> instance in question, that would certainly help.  I can readily
set up an
>>>>>>> ssh tunnel if that is what is required.  But I won't be able
to do it until
>>>>>>> I get home tonight.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg <
>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>
>>>>>>>> Karl,
>>>>>>>>
>>>>>>>> This is everything that got generated, from the very beginning,
>>>>>>>> meaning that I did a fresh build, new database, new connection
definitions,
>>>>>>>> start. The log must have rolled but the .1 log is included.
>>>>>>>>
>>>>>>>> If I were to get you access to the actual test system, would
you
>>>>>>>> mind taking a look? It may be more efficient than sending
logs..
>>>>>>>>
>>>>>>>> - Dmitry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> These logs are different but have exactly the same problem;
they
>>>>>>>>> start in the middle when the crawl is already well underway.
 I'm wondering
>>>>>>>>> if by chance you have more than one agents process running
or something?
>>>>>>>>> Or maybe the log is rolling and stuff is getting lost?
 What's there is not
>>>>>>>>> what I would expect to see, at all.
>>>>>>>>>
>>>>>>>>> I *did* manage to find two transactions that look like
they might
>>>>>>>>> be helpful, but because the *results* of those transactions
are required by
>>>>>>>>> transactions that take place minutes *before* in the
log, I have no
>>>>>>>>> confidence that I'm looking at anything meaningful. 
But I'll get back to
>>>>>>>>> you on what I find nonetheless.
>>>>>>>>>
>>>>>>>>> If you decide repeat this exercise, try watching the
log with
>>>>>>>>> "tail -f" before starting the job.  You should not see
any log contents at
>>>>>>>>> all until the job is started.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <
>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>
>>>>>>>>>> Karl,
>>>>>>>>>>
>>>>>>>>>> Attached please find logs which start at the beginning.
I started
>>>>>>>>>> from a fresh build (clean db etc.), the logs start
at server start, then I
>>>>>>>>>> create the output connection and the repo connection,
then the job, and
>>>>>>>>>> then I fire off the job. I aborted the execution
about a minute into it or
>>>>>>>>>> so.  That's all that's in the logs with:
>>>>>>>>>>
>>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG
>>>>>>>>>>
>>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG
>>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>>>>>>>>>
>>>>>>>>>> - Dmitry
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>
>>>>>>>>>>> Are you sure these are the right logs?
>>>>>>>>>>> - They start right in the middle of a crawl
>>>>>>>>>>> - They are already in a broken state when they
start, e.g. the
>>>>>>>>>>> kinds of things that are being looked up are
already nonsense paths
>>>>>>>>>>>
>>>>>>>>>>> I need to see logs from the BEGINNING of a fresh
crawl to see
>>>>>>>>>>> how the nonsense paths happen.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg
<
>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> I've generated logs with details as we discussed.
>>>>>>>>>>>>
>>>>>>>>>>>> The job was created afresh, as before:
>>>>>>>>>>>> Path rules:
>>>>>>>>>>>> /* file include
>>>>>>>>>>>> /* library include
>>>>>>>>>>>> /* list include
>>>>>>>>>>>> /* site include
>>>>>>>>>>>> Metadata:
>>>>>>>>>>>> /* include true
>>>>>>>>>>>> The logs are attached.
>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> "Do you think that this issue is generic
with regard to any
>>>>>>>>>>>>> Amz instance?"
>>>>>>>>>>>>>
>>>>>>>>>>>>> I presume so, since you didn't apparently
do anything special
>>>>>>>>>>>>> to set one of these up.  Unfortunately,
such instances are not part of the
>>>>>>>>>>>>> free tier, so I am still constrained
from setting one up for myself because
>>>>>>>>>>>>> of household rules here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> "For now, I assume our only workaround
is to list the paths of
>>>>>>>>>>>>> interest manually"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Depending on what is going wrong, that
may not even work.  It
>>>>>>>>>>>>> looks like several SharePoint web service
calls may be affected, and not in
>>>>>>>>>>>>> a cleanly predictable way, for this to
happen.
>>>>>>>>>>>>>
>>>>>>>>>>>>> "is identification and extraction of
attachments supported in
>>>>>>>>>>>>> the SP connector?"
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF in general leaves identification
and extraction to
>>>>>>>>>>>>> the search engine.  Solr, for instance
uses Tika for this, if so
>>>>>>>>>>>>> configured.  You can configure your Solr
output connection to include or
>>>>>>>>>>>>> exclude specific mime types or extensions
if you want to limit what is
>>>>>>>>>>>>> attempted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry
Goldenberg <
>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Karl. Do you think that this
issue is generic with
>>>>>>>>>>>>>> regard to any Amz instance? I'm just
wondering how easily reproducible this
>>>>>>>>>>>>>> may be..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I assume our only workaround
is to list the paths of
>>>>>>>>>>>>>> interest manually, i.e. add explicit
rules for each library and list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A related subject - is identification
and extraction of
>>>>>>>>>>>>>> attachments supported in the SP connector?
 E.g. if I have a Word doc
>>>>>>>>>>>>>> attached to a Task list item, would
that be extracted?  So far, I see that
>>>>>>>>>>>>>> library content gets crawled and
I'm getting the list item data but am not
>>>>>>>>>>>>>> sure what happens to the attachments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM,
Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the additional information.
 It does appear like
>>>>>>>>>>>>>>> the method that lists subsites
is not working as expected under AWS.  Nor
>>>>>>>>>>>>>>> are some number of other methods
which supposedly just list the children of
>>>>>>>>>>>>>>> a subsite.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've reopened CONNECTORS-772
to work on addressing this
>>>>>>>>>>>>>>> issue.  Please stay tuned.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08
AM, Dmitry Goldenberg <
>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Most of the paths that get
generated are listed in the
>>>>>>>>>>>>>>>> attached log, they match
what shows up in the diag report. So I'm not sure
>>>>>>>>>>>>>>>> where they diverge, most
of them just don't seem right.  There are 3
>>>>>>>>>>>>>>>> subsites rooted in the main
site: Abcd, Defghij, Klmnopqr.  It's strange
>>>>>>>>>>>>>>>> that the connector would
try such paths as:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements///
-- there
>>>>>>>>>>>>>>>> are multiple repetitions
of the same subsite on the path and to begin with,
>>>>>>>>>>>>>>>> Defghij is not a subsite
of Klmnopqr, so why would it try this? the /// at
>>>>>>>>>>>>>>>> the end doesn't seem correct
either, unless I'm missing something in how
>>>>>>>>>>>>>>>> this pathing works.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /Test Library
>>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements
-- looks wrong. A
>>>>>>>>>>>>>>>> docname is mixed into the
path, a subsite ends up after a docname?...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /Shared
>>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/
-- same types of
>>>>>>>>>>>>>>>> issues plus now somehow the
docname got split with a forward slash?..
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are also a bunch of
>>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's.
 Perhaps this logic doesn't fit with the
>>>>>>>>>>>>>>>> pathing we're seeing on this
amz-based installation?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd expect the logic to just
know that root contains 3
>>>>>>>>>>>>>>>> subsites, and work off that.
Each subsite has a specific list of libraries
>>>>>>>>>>>>>>>> and lists, etc. It seems
odd that the connector gets into this matching
>>>>>>>>>>>>>>>> pattern, and tries what looks
like thousands of variations (I aborted the
>>>>>>>>>>>>>>>> execution).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56
AM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To clarify, the way you
would need to analyze this is to
>>>>>>>>>>>>>>>>> run a crawl with the
wildcards as you have selected, abort if necessary
>>>>>>>>>>>>>>>>> after a while, and then
use the Document Status report to list the document
>>>>>>>>>>>>>>>>> identifiers that had
been generated.  Find a document identifier that you
>>>>>>>>>>>>>>>>> believe represents a
path that is illegal, and figure out what SOAP
>>>>>>>>>>>>>>>>> getChild call caused
the problem by returning incorrect data.  In other
>>>>>>>>>>>>>>>>> words, find the point
in the path where the path diverges from what exists
>>>>>>>>>>>>>>>>> into what doesn't exist,
and go back in the ManifoldCF logs to find the
>>>>>>>>>>>>>>>>> particular SOAP request
that led to the issue.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd expect from your
description that the problem lies
>>>>>>>>>>>>>>>>> with getting child sites
given a site path, but that's just a guess at this
>>>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013
at 6:40 PM, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't understand
what you mean by "I've tried the set
>>>>>>>>>>>>>>>>>> of wildcards as below
and I seem to be running into a lot of cycles, where
>>>>>>>>>>>>>>>>>> various subsite folders
are appended to each other and an extraction of
>>>>>>>>>>>>>>>>>> data at all of those
locations is attempted".   If you are seeing cycles it
>>>>>>>>>>>>>>>>>> means that document
discovery is still failing in some way.  For each
>>>>>>>>>>>>>>>>>> folder/library/site/subsite,
only the children of that
>>>>>>>>>>>>>>>>>> folder/library/site/subsite
should be appended to the path - ever.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you can give a
specific example, preferably including
>>>>>>>>>>>>>>>>>> the soap back-and-forth,
that would be very helpful.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013
at 1:40 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Quick question.
Is there an easy way to configure an SP
>>>>>>>>>>>>>>>>>>> repo connection
for crawling of all content, from the root site all the way
>>>>>>>>>>>>>>>>>>> down?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've tried the
set of wildcards as below and I seem to
>>>>>>>>>>>>>>>>>>> be running into
a lot of cycles, where various subsite folders are appended
>>>>>>>>>>>>>>>>>>> to each other
and an extraction of data at all of those locations is
>>>>>>>>>>>>>>>>>>> attempted. Ideally
I'd like to avoid having to construct an exact set of
>>>>>>>>>>>>>>>>>>> paths because
the set may change, especially with new content being added.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'd also like
to pull down any files attached to list
>>>>>>>>>>>>>>>>>>> items. I'm hoping
that some type of "/* file include" should do it, once I
>>>>>>>>>>>>>>>>>>> figure out how
to safely include all content.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message