manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Tue, 17 Sep 2013 11:38:16 GMT
Hi Dmitry,

Another interesting bit from the log:

>>>>>>
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/_catalogs/lt/Forms/AllItems.aspx', 'List Template Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/_catalogs/masterpage/Forms/AllItems.aspx', 'Master Page Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/Shared Documents/Forms/AllItems.aspx', 'Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/SiteAssets/Forms/AllItems.aspx', 'Site Assets'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/SitePages/Forms/AllPages.aspx', 'Site Pages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/_catalogs/solutions/Forms/AllItems.aspx', 'Solution Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/Style Library/Forms/AllItems.aspx', 'Style Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/Test Library 1/Forms/AllItems.aspx', 'Test Library 1'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/_catalogs/theme/Forms/AllItems.aspx', 'Theme Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
list: '/_catalogs/wp/Forms/AllItems.aspx', 'Web Part Gallery'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared
Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents' exactly matched
rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Shared Documents'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
whether to include library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets' exactly matched rule
path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SiteAssets'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
whether to include library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages' exactly matched rule
path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/SitePages'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Checking
whether to include library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style
Library'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Library
'/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library' exactly matched
rule path '/*'
DEBUG 2013-09-16 11:43:56,799 (Worker thread '7') - SharePoint: Including
library '/Abcd/Klmnopqr/Klmnopqr/Defghij/Defghij/Style Library'
<<<<<<

This time it appears that it is the Lists service that is broken and does
not recognize the parent site.

I haven't corrected this problem yet since now I am beginning to wonder if
*any* of the web services under Amazon work at all for subsites.  We may be
better off implementing everything we need in the MCPermissions service.  I
will ponder this as I continue to research the logs.

It's still valuable to check my getSites() implementation.  I'll be doing
another round of work tonight on the plugin.


Karl


On Mon, Sep 16, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com> wrote:

> The augmented plugin can be downloaded from
> http://people.apache.org/~kwright/MetaCarta.SharePoint.MCPermissionsService.wsp.  The
revised connector code is also ready, and should be checked out and
> built from
> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-772 .
>
> Once you set it all up, you can see if it is doing the right thing by just
> trying to drill down through subsites in the UI.  You should always see a
> list of subsites that is appropriate for the context you are in; if this
> does not happen it is not working.
>
> Thanks,
> Karl
>
>
>
> On Mon, Sep 16, 2013 at 7:45 PM, Dmitry Goldenberg <dgoldenberg@kmwllc.com
> > wrote:
>
>> Karl,
>>
>> I can see how preloading the list of subsites may be less optimal.. The
>> advantage of doing it this way is one call and you've got the structure in
>> memory, which may be OK unless there are sites with a ton of subsites which
>> may stress out memory. The disadvantage is having to throw this structure
>> around..
>>
>> Yes, I'll certainly help test out your changes, just let me know when
>> they're available.
>>
>> Thanks,
>> - Dmitry
>>
>>
>> On Mon, Sep 16, 2013 at 7:19 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Thanks for the code snippet.  I'd prefer, though, to not preload the
>>> entire site structure in memory.  Probably it would be better to just add
>>> another method to the ManifoldCF SharePoint 2010 plugin.  More methods are
>>> going to be added anyway to support Claim Space Authentication, so I guess
>>> this would be just one more.
>>>
>>> We honestly have never seen this problem before - so it's not just
>>> flakiness, it has something to do with the installation, I'm certain.  At
>>> any rate, I'll get going right away on a workaround - if you are willing to
>>> test what I produce.  I'm also certain there is at least one other issue,
>>> but hopefully that will become clearer once this one is resolved.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Sep 16, 2013 at 6:49 PM, Dmitry Goldenberg <
>>> dgoldenberg@kmwllc.com> wrote:
>>>
>>>> Karl,
>>>>
>>>> >> subsite discovery is effectively disabled except directly under
the
>>>> root site
>>>>
>>>> Yes. Come to think of it, I once came across this problem while
>>>> implementing a SharePoint connector.  I'm not sure whether it's exactly
>>>> what's happening with the issue we're discussing but looks like it.
>>>>
>>>> I started off by using multiple getWebCollection calls to get child
>>>> subsites of sites and trying to navigate down that way. The problem was
>>>> that getWebCollection was always returning the immediate subsites of the
>>>> root site no matter whether you're at the root or below, so I ended up
>>>> generating infinite loops.
>>>>
>>>> I switched over to using a single getAllSubWebCollection call and
>>>> caching its results. That call returns the full list of all subsites as
>>>> pairs of Title and Url.  I had a POJO similar to the one below which held
>>>> the list of sites and contained logic for enumerating the child sites,
>>>> given the URL of a (parent) site.  From what I recall, getWebCollection
>>>> works inconsistently, either across SP versions or across installations,
>>>> but the logic below should work in any case.
>>>>
>>>> *** public class SubSiteCollection -- holds a list of CrawledSite
>>>> pojo's each of which is a { title, url }.
>>>>
>>>> *** SubSiteCollection has the following:
>>>>
>>>>  public List<CrawledSite> getImmediateSubSites(String siteUrl) {
>>>>   List<CrawledSite> subSites = new ArrayList<CrawledSite>();
>>>>   for (CrawledSite site : sites) {
>>>>    if (isChildOf(siteUrl, site.getUrl().toString())) {
>>>>     subSites.add(site);
>>>>    }
>>>>   }
>>>>   return subSites;
>>>>  }
>>>>
>>>>  private static boolean isChildOf(String parentUrl, String urlToCheck) {
>>>>   final String parent = normalizeUrl(parentUrl);
>>>>   final String child = normalizeUrl(urlToCheck);
>>>>   boolean ret = false;
>>>>   if (child.startsWith(parent)) {
>>>>    String remainder = child.substring(parent.length());
>>>>    ret = StringUtils.countOccurrencesOf(remainder, SLASH) == 1;
>>>>   }
>>>>   return ret;
>>>>  }
>>>>
>>>>  private static String normalizeUrl(String url) {
>>>>   return ((url.endsWith(SLASH)) ? url : url + SLASH).toLowerCase();
>>>>  }
>>>>
>>>> - Dmitry
>>>>
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>
>>>>> Hi Dmitry,
>>>>>
>>>>> Have a look at this sequence also:
>>>>>
>>>>> >>>>>>
>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>> Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd',
>>>>> 'Abcd'
>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>> Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij',
>>>>> 'Defghij'
>>>>> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint:
>>>>> Subsite list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr',
>>>>> 'Klmnopqr'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path '/*'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path '/*'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Checking whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
>>>>> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rule path '/*'
>>>>> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint:
>>>>> Including site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>>>>>
>>>>> <<<<<<
>>>>>
>>>>> This is using the GetSites(String parent) method with a site name of
>>>>> "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites (!!).  The
>>>>> parent path is not correct, obviously, but nevertheless this one way
in
>>>>> which paths are getting completely messed up.  It *looks* like the Webs
web
>>>>> service is broken in such a way as to ignore the URL coming in, except
for
>>>>> the base part, which means that subsite discovery is effectively disabled
>>>>> except directly under the root site.
>>>>>
>>>>> This might still be OK if it is not possible to create subsites of
>>>>> subsites in this version of SharePoint.  Can you confirm that this is
or is
>>>>> not possible?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>
>>>>>> "This is everything that got generated, from the very beginning"
>>>>>>
>>>>>> Well, something isn't right.  What I expect to see that I don't right
>>>>>> up front are:
>>>>>>
>>>>>> - A webs "getWebCollection" invocation for /_vti_bin/webs.asmx
>>>>>> - Two lists "getListCollection" invocations for /_vti_bin/lists.asmx
>>>>>>
>>>>>> Instead the first transactions I see are from already busted URLs
-
>>>>>> which make no sense since there would be no way they should have
been able
>>>>>> to get queued yet.
>>>>>>
>>>>>> So there are a number of possibilities.  First, maybe the log isn't
>>>>>> getting cleared out, and the session in question therefore starts
somewhere
>>>>>> in the middle of manifoldcf.log.1.  But no:
>>>>>>
>>>>>> >>>>>>
>>>>>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
>>>>>> grep: input lines truncated - result questionable
>>>>>> <<<<<<
>>>>>>
>>>>>> Nevertheless there are some interesting points here.  First, note
the
>>>>>> following response, which I've been able to determine is against
"Test
>>>>>> Library 1":
>>>>>>
>>>>>> >>>>>>
>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>> getListItems xml response: '<GetListItems xmlns="
>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/"><GetListItemsResponse
>>>>>> xmlns=""><GetListItemsResult
>>>>>> FileRef="SitePages/Home.aspx"/></GetListItemsResponse></GetListItems>'
>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>> Checking whether to include document '/SitePages/Home.aspx'
>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
File
>>>>>> '/SitePages/Home.aspx' exactly matched rule path '/*'
>>>>>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>>>>>> Including file '/SitePages/Home.aspx'
>>>>>>  WARN 2013-09-16 13:02:31,590 (Worker thread '23') - Sharepoint:
>>>>>> Unexpected relPath structure; path is '/SitePages/Home.aspx', but
expected
>>>>>> <list/library> length of 26
>>>>>> <<<<<<
>>>>>>
>>>>>> The FileRef in this case is pointing at what, exactly?  Is there
a
>>>>>> SitePages/Home.aspx in the "Test Library 1" library?  Or does it
mean to
>>>>>> refer back to the root site with this URL construction?  And since
this is
>>>>>> supposedly at the root level, how come the combined site + library
name
>>>>>> comes out to 26??  I get 15, which leaves 11 characters unaccounted
for.
>>>>>>
>>>>>> I'm still looking at the logs to see if I can glean key information.
>>>>>> Later, if I could set up a crawl against the sharepoint instance
in
>>>>>> question, that would certainly help.  I can readily set up an ssh
tunnel if
>>>>>> that is what is required.  But I won't be able to do it until I get
home
>>>>>> tonight.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg <
>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>
>>>>>>> Karl,
>>>>>>>
>>>>>>> This is everything that got generated, from the very beginning,
>>>>>>> meaning that I did a fresh build, new database, new connection
definitions,
>>>>>>> start. The log must have rolled but the .1 log is included.
>>>>>>>
>>>>>>> If I were to get you access to the actual test system, would
you
>>>>>>> mind taking a look? It may be more efficient than sending logs..
>>>>>>>
>>>>>>> - Dmitry
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>
>>>>>>>> These logs are different but have exactly the same problem;
they
>>>>>>>> start in the middle when the crawl is already well underway.
 I'm wondering
>>>>>>>> if by chance you have more than one agents process running
or something?
>>>>>>>> Or maybe the log is rolling and stuff is getting lost?  What's
there is not
>>>>>>>> what I would expect to see, at all.
>>>>>>>>
>>>>>>>> I *did* manage to find two transactions that look like they
might
>>>>>>>> be helpful, but because the *results* of those transactions
are required by
>>>>>>>> transactions that take place minutes *before* in the log,
I have no
>>>>>>>> confidence that I'm looking at anything meaningful.  But
I'll get back to
>>>>>>>> you on what I find nonetheless.
>>>>>>>>
>>>>>>>> If you decide repeat this exercise, try watching the log
with "tail
>>>>>>>> -f" before starting the job.  You should not see any log
contents at all
>>>>>>>> until the job is started.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <
>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>
>>>>>>>>> Karl,
>>>>>>>>>
>>>>>>>>> Attached please find logs which start at the beginning.
I started
>>>>>>>>> from a fresh build (clean db etc.), the logs start at
server start, then I
>>>>>>>>> create the output connection and the repo connection,
then the job, and
>>>>>>>>> then I fire off the job. I aborted the execution about
a minute into it or
>>>>>>>>> so.  That's all that's in the logs with:
>>>>>>>>>
>>>>>>>>> org.apache.manifoldcf.connectors=DEBUG
>>>>>>>>>
>>>>>>>>> log4j.logger.httpclient.wire.header=DEBUG
>>>>>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>>>>>>>>
>>>>>>>>> - Dmitry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>
>>>>>>>>>> Are you sure these are the right logs?
>>>>>>>>>> - They start right in the middle of a crawl
>>>>>>>>>> - They are already in a broken state when they start,
e.g. the
>>>>>>>>>> kinds of things that are being looked up are already
nonsense paths
>>>>>>>>>>
>>>>>>>>>> I need to see logs from the BEGINNING of a fresh
crawl to see how
>>>>>>>>>> the nonsense paths happen.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg
<
>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Karl,
>>>>>>>>>>>
>>>>>>>>>>> I've generated logs with details as we discussed.
>>>>>>>>>>>
>>>>>>>>>>> The job was created afresh, as before:
>>>>>>>>>>> Path rules:
>>>>>>>>>>> /* file include
>>>>>>>>>>> /* library include
>>>>>>>>>>> /* list include
>>>>>>>>>>> /* site include
>>>>>>>>>>> Metadata:
>>>>>>>>>>> /* include true
>>>>>>>>>>> The logs are attached.
>>>>>>>>>>> - Dmitry
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright
<
>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> "Do you think that this issue is generic
with regard to any Amz
>>>>>>>>>>>> instance?"
>>>>>>>>>>>>
>>>>>>>>>>>> I presume so, since you didn't apparently
do anything special
>>>>>>>>>>>> to set one of these up.  Unfortunately, such
instances are not part of the
>>>>>>>>>>>> free tier, so I am still constrained from
setting one up for myself because
>>>>>>>>>>>> of household rules here.
>>>>>>>>>>>>
>>>>>>>>>>>> "For now, I assume our only workaround is
to list the paths of
>>>>>>>>>>>> interest manually"
>>>>>>>>>>>>
>>>>>>>>>>>> Depending on what is going wrong, that may
not even work.  It
>>>>>>>>>>>> looks like several SharePoint web service
calls may be affected, and not in
>>>>>>>>>>>> a cleanly predictable way, for this to happen.
>>>>>>>>>>>>
>>>>>>>>>>>> "is identification and extraction of attachments
supported in
>>>>>>>>>>>> the SP connector?"
>>>>>>>>>>>>
>>>>>>>>>>>> ManifoldCF in general leaves identification
and extraction to
>>>>>>>>>>>> the search engine.  Solr, for instance uses
Tika for this, if so
>>>>>>>>>>>> configured.  You can configure your Solr
output connection to include or
>>>>>>>>>>>> exclude specific mime types or extensions
if you want to limit what is
>>>>>>>>>>>> attempted.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry
Goldenberg <
>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Karl. Do you think that this
issue is generic with
>>>>>>>>>>>>> regard to any Amz instance? I'm just
wondering how easily reproducible this
>>>>>>>>>>>>> may be..
>>>>>>>>>>>>>
>>>>>>>>>>>>> For now, I assume our only workaround
is to list the paths of
>>>>>>>>>>>>> interest manually, i.e. add explicit
rules for each library and list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A related subject - is identification
and extraction of
>>>>>>>>>>>>> attachments supported in the SP connector?
 E.g. if I have a Word doc
>>>>>>>>>>>>> attached to a Task list item, would that
be extracted?  So far, I see that
>>>>>>>>>>>>> library content gets crawled and I'm
getting the list item data but am not
>>>>>>>>>>>>> sure what happens to the attachments.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl
Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the additional information.
 It does appear like
>>>>>>>>>>>>>> the method that lists subsites is
not working as expected under AWS.  Nor
>>>>>>>>>>>>>> are some number of other methods
which supposedly just list the children of
>>>>>>>>>>>>>> a subsite.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've reopened CONNECTORS-772 to work
on addressing this
>>>>>>>>>>>>>> issue.  Please stay tuned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM,
Dmitry Goldenberg <
>>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Most of the paths that get generated
are listed in the
>>>>>>>>>>>>>>> attached log, they match what
shows up in the diag report. So I'm not sure
>>>>>>>>>>>>>>> where they diverge, most of them
just don't seem right.  There are 3
>>>>>>>>>>>>>>> subsites rooted in the main site:
Abcd, Defghij, Klmnopqr.  It's strange
>>>>>>>>>>>>>>> that the connector would try
such paths as:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements///
-- there
>>>>>>>>>>>>>>> are multiple repetitions of the
same subsite on the path and to begin with,
>>>>>>>>>>>>>>> Defghij is not a subsite of Klmnopqr,
so why would it try this? the /// at
>>>>>>>>>>>>>>> the end doesn't seem correct
either, unless I'm missing something in how
>>>>>>>>>>>>>>> this pathing works.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /Test Library
>>>>>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements
-- looks wrong. A
>>>>>>>>>>>>>>> docname is mixed into the path,
a subsite ends up after a docname?...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /Shared
>>>>>>>>>>>>>>> Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/
-- same types of
>>>>>>>>>>>>>>> issues plus now somehow the docname
got split with a forward slash?..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are also a bunch of
>>>>>>>>>>>>>>> StringIndexOutOfBoundsException's.
 Perhaps this logic doesn't fit with the
>>>>>>>>>>>>>>> pathing we're seeing on this
amz-based installation?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd expect the logic to just
know that root contains 3
>>>>>>>>>>>>>>> subsites, and work off that.
Each subsite has a specific list of libraries
>>>>>>>>>>>>>>> and lists, etc. It seems odd
that the connector gets into this matching
>>>>>>>>>>>>>>> pattern, and tries what looks
like thousands of variations (I aborted the
>>>>>>>>>>>>>>> execution).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56
AM, Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To clarify, the way you would
need to analyze this is to
>>>>>>>>>>>>>>>> run a crawl with the wildcards
as you have selected, abort if necessary
>>>>>>>>>>>>>>>> after a while, and then use
the Document Status report to list the document
>>>>>>>>>>>>>>>> identifiers that had been
generated.  Find a document identifier that you
>>>>>>>>>>>>>>>> believe represents a path
that is illegal, and figure out what SOAP
>>>>>>>>>>>>>>>> getChild call caused the
problem by returning incorrect data.  In other
>>>>>>>>>>>>>>>> words, find the point in
the path where the path diverges from what exists
>>>>>>>>>>>>>>>> into what doesn't exist,
and go back in the ManifoldCF logs to find the
>>>>>>>>>>>>>>>> particular SOAP request that
led to the issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd expect from your description
that the problem lies with
>>>>>>>>>>>>>>>> getting child sites given
a site path, but that's just a guess at this
>>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40
PM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't understand what
you mean by "I've tried the set of
>>>>>>>>>>>>>>>>> wildcards as below and
I seem to be running into a lot of cycles, where
>>>>>>>>>>>>>>>>> various subsite folders
are appended to each other and an extraction of
>>>>>>>>>>>>>>>>> data at all of those
locations is attempted".   If you are seeing cycles it
>>>>>>>>>>>>>>>>> means that document discovery
is still failing in some way.  For each
>>>>>>>>>>>>>>>>> folder/library/site/subsite,
only the children of that
>>>>>>>>>>>>>>>>> folder/library/site/subsite
should be appended to the path - ever.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If you can give a specific
example, preferably including
>>>>>>>>>>>>>>>>> the soap back-and-forth,
that would be very helpful.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Sep 15, 2013
at 1:40 PM, Dmitry Goldenberg <
>>>>>>>>>>>>>>>>> dgoldenberg@kmwllc.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Quick question. Is
there an easy way to configure an SP
>>>>>>>>>>>>>>>>>> repo connection for
crawling of all content, from the root site all the way
>>>>>>>>>>>>>>>>>> down?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've tried the set
of wildcards as below and I seem to be
>>>>>>>>>>>>>>>>>> running into a lot
of cycles, where various subsite folders are appended to
>>>>>>>>>>>>>>>>>> each other and an
extraction of data at all of those locations is
>>>>>>>>>>>>>>>>>> attempted. Ideally
I'd like to avoid having to construct an exact set of
>>>>>>>>>>>>>>>>>> paths because the
set may change, especially with new content being added.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd also like to
pull down any files attached to list
>>>>>>>>>>>>>>>>>> items. I'm hoping
that some type of "/* file include" should do it, once I
>>>>>>>>>>>>>>>>>> figure out how to
safely include all content.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message