manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@kmwllc.com>
Subject Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed
Date Mon, 16 Sep 2013 22:49:50 GMT
Karl,

>> subsite discovery is effectively disabled except directly under the root
site

Yes. Come to think of it, I once came across this problem while
implementing a SharePoint connector.  I'm not sure whether it's exactly
what's happening with the issue we're discussing but looks like it.

I started off by using multiple getWebCollection calls to get child
subsites of sites and trying to navigate down that way. The problem was
that getWebCollection was always returning the immediate subsites of the
root site no matter whether you're at the root or below, so I ended up
generating infinite loops.

I switched over to using a single getAllSubWebCollection call and caching
its results. That call returns the full list of all subsites as pairs of
Title and Url.  I had a POJO similar to the one below which held the list
of sites and contained logic for enumerating the child sites, given the URL
of a (parent) site.  From what I recall, getWebCollection works
inconsistently, either across SP versions or across installations, but the
logic below should work in any case.

*** public class SubSiteCollection -- holds a list of CrawledSite pojo's
each of which is a { title, url }.

*** SubSiteCollection has the following:

 public List<CrawledSite> getImmediateSubSites(String siteUrl) {
  List<CrawledSite> subSites = new ArrayList<CrawledSite>();
  for (CrawledSite site : sites) {
   if (isChildOf(siteUrl, site.getUrl().toString())) {
    subSites.add(site);
   }
  }
  return subSites;
 }

 private static boolean isChildOf(String parentUrl, String urlToCheck) {
  final String parent = normalizeUrl(parentUrl);
  final String child = normalizeUrl(urlToCheck);
  boolean ret = false;
  if (child.startsWith(parent)) {
   String remainder = child.substring(parent.length());
   ret = StringUtils.countOccurrencesOf(remainder, SLASH) == 1;
  }
  return ret;
 }

 private static String normalizeUrl(String url) {
  return ((url.endsWith(SLASH)) ? url : url + SLASH).toLowerCase();
 }

- Dmitry



On Mon, Sep 16, 2013 at 2:54 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Dmitry,
>
> Have a look at this sequence also:
>
> >>>>>>
> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Subsite
> list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Abcd', 'Abcd'
> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Subsite
> list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Defghij', 'Defghij'
> DEBUG 2013-09-16 11:43:56,817 (Worker thread '8') - SharePoint: Subsite
> list: 'http://ec2-99-99-99-99.compute-1.amazonaws.com/Klmnopqr',
> 'Klmnopqr'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Checking
> whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd' exactly matched rule path '/*'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Including
> site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Abcd'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Checking
> whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij' exactly matched rule path '/*'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Including
> site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Defghij'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Checking
> whether to include site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Site
> '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr' exactly matched rule path '/*'
> DEBUG 2013-09-16 11:43:56,818 (Worker thread '8') - SharePoint: Including
> site '/Klmnopqr/Abcd/Abcd/Klmnopqr/Klmnopqr'
>
> <<<<<<
>
> This is using the GetSites(String parent) method with a site name of
> "/Klmnopqr/Abcd/Abcd/Klmnopqr", and getting back three sites (!!).  The
> parent path is not correct, obviously, but nevertheless this one way in
> which paths are getting completely messed up.  It *looks* like the Webs web
> service is broken in such a way as to ignore the URL coming in, except for
> the base part, which means that subsite discovery is effectively disabled
> except directly under the root site.
>
> This might still be OK if it is not possible to create subsites of
> subsites in this version of SharePoint.  Can you confirm that this is or is
> not possible?
>
> Karl
>
>
>
> On Mon, Sep 16, 2013 at 2:42 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> "This is everything that got generated, from the very beginning"
>>
>> Well, something isn't right.  What I expect to see that I don't right up
>> front are:
>>
>> - A webs "getWebCollection" invocation for /_vti_bin/webs.asmx
>> - Two lists "getListCollection" invocations for /_vti_bin/lists.asmx
>>
>> Instead the first transactions I see are from already busted URLs - which
>> make no sense since there would be no way they should have been able to get
>> queued yet.
>>
>> So there are a number of possibilities.  First, maybe the log isn't
>> getting cleared out, and the session in question therefore starts somewhere
>> in the middle of manifoldcf.log.1.  But no:
>>
>> >>>>>>
>> C:\logs>grep "POST /_vti_bin/webs" manifoldcf.log.1
>> grep: input lines truncated - result questionable
>> <<<<<<
>>
>> Nevertheless there are some interesting points here.  First, note the
>> following response, which I've been able to determine is against "Test
>> Library 1":
>>
>> >>>>>>
>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>> getListItems xml response: '<GetListItems xmlns="
>> http://schemas.microsoft.com/sharepoint/soap/directory/"><GetListItemsResponse
>> xmlns=""><GetListItemsResult
>> FileRef="SitePages/Home.aspx"/></GetListItemsResponse></GetListItems>'
>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: Checking
>> whether to include document '/SitePages/Home.aspx'
>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint: File
>> '/SitePages/Home.aspx' exactly matched rule path '/*'
>> DEBUG 2013-09-16 13:02:31,590 (Worker thread '23') - SharePoint:
>> Including file '/SitePages/Home.aspx'
>>  WARN 2013-09-16 13:02:31,590 (Worker thread '23') - Sharepoint:
>> Unexpected relPath structure; path is '/SitePages/Home.aspx', but expected
>> <list/library> length of 26
>> <<<<<<
>>
>> The FileRef in this case is pointing at what, exactly?  Is there a
>> SitePages/Home.aspx in the "Test Library 1" library?  Or does it mean to
>> refer back to the root site with this URL construction?  And since this is
>> supposedly at the root level, how come the combined site + library name
>> comes out to 26??  I get 15, which leaves 11 characters unaccounted for.
>>
>> I'm still looking at the logs to see if I can glean key information.
>> Later, if I could set up a crawl against the sharepoint instance in
>> question, that would certainly help.  I can readily set up an ssh tunnel if
>> that is what is required.  But I won't be able to do it until I get home
>> tonight.
>>
>> Karl
>>
>>
>>
>> On Mon, Sep 16, 2013 at 1:58 PM, Dmitry Goldenberg <
>> dgoldenberg@kmwllc.com> wrote:
>>
>>> Karl,
>>>
>>> This is everything that got generated, from the very beginning, meaning
>>> that I did a fresh build, new database, new connection definitions, start.
>>> The log must have rolled but the .1 log is included.
>>>
>>> If I were to get you access to the actual test system, would you mind
>>> taking a look? It may be more efficient than sending logs..
>>>
>>> - Dmitry
>>>
>>>
>>> On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> These logs are different but have exactly the same problem; they start
>>>> in the middle when the crawl is already well underway.  I'm wondering if
by
>>>> chance you have more than one agents process running or something?  Or
>>>> maybe the log is rolling and stuff is getting lost?  What's there is not
>>>> what I would expect to see, at all.
>>>>
>>>> I *did* manage to find two transactions that look like they might be
>>>> helpful, but because the *results* of those transactions are required by
>>>> transactions that take place minutes *before* in the log, I have no
>>>> confidence that I'm looking at anything meaningful.  But I'll get back to
>>>> you on what I find nonetheless.
>>>>
>>>> If you decide repeat this exercise, try watching the log with "tail -f"
>>>> before starting the job.  You should not see any log contents at all until
>>>> the job is started.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <
>>>> dgoldenberg@kmwllc.com> wrote:
>>>>
>>>>> Karl,
>>>>>
>>>>> Attached please find logs which start at the beginning. I started from
>>>>> a fresh build (clean db etc.), the logs start at server start, then I
>>>>> create the output connection and the repo connection, then the job, and
>>>>> then I fire off the job. I aborted the execution about a minute into
it or
>>>>> so.  That's all that's in the logs with:
>>>>>
>>>>> org.apache.manifoldcf.connectors=DEBUG
>>>>>
>>>>> log4j.logger.httpclient.wire.header=DEBUG
>>>>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>>>>
>>>>> - Dmitry
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> Are you sure these are the right logs?
>>>>>> - They start right in the middle of a crawl
>>>>>> - They are already in a broken state when they start, e.g. the kinds
>>>>>> of things that are being looked up are already nonsense paths
>>>>>>
>>>>>> I need to see logs from the BEGINNING of a fresh crawl to see how
the
>>>>>> nonsense paths happen.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg <
>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>
>>>>>>> Karl,
>>>>>>>
>>>>>>> I've generated logs with details as we discussed.
>>>>>>>
>>>>>>> The job was created afresh, as before:
>>>>>>> Path rules:
>>>>>>> /* file include
>>>>>>> /* library include
>>>>>>> /* list include
>>>>>>> /* site include
>>>>>>> Metadata:
>>>>>>> /* include true
>>>>>>> The logs are attached.
>>>>>>> - Dmitry
>>>>>>>
>>>>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>
>>>>>>>> "Do you think that this issue is generic with regard to any
Amz
>>>>>>>> instance?"
>>>>>>>>
>>>>>>>> I presume so, since you didn't apparently do anything special
to
>>>>>>>> set one of these up.  Unfortunately, such instances are not
part of the
>>>>>>>> free tier, so I am still constrained from setting one up
for myself because
>>>>>>>> of household rules here.
>>>>>>>>
>>>>>>>> "For now, I assume our only workaround is to list the paths
of
>>>>>>>> interest manually"
>>>>>>>>
>>>>>>>> Depending on what is going wrong, that may not even work.
 It looks
>>>>>>>> like several SharePoint web service calls may be affected,
and not in a
>>>>>>>> cleanly predictable way, for this to happen.
>>>>>>>>
>>>>>>>> "is identification and extraction of attachments supported
in the
>>>>>>>> SP connector?"
>>>>>>>>
>>>>>>>> ManifoldCF in general leaves identification and extraction
to the
>>>>>>>> search engine.  Solr, for instance uses Tika for this, if
so configured.
>>>>>>>> You can configure your Solr output connection to include
or exclude
>>>>>>>> specific mime types or extensions if you want to limit what
is attempted.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg <
>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Karl. Do you think that this issue is generic
with regard
>>>>>>>>> to any Amz instance? I'm just wondering how easily reproducible
this may
>>>>>>>>> be..
>>>>>>>>>
>>>>>>>>> For now, I assume our only workaround is to list the
paths of
>>>>>>>>> interest manually, i.e. add explicit rules for each library
and list.
>>>>>>>>>
>>>>>>>>> A related subject - is identification and extraction
of
>>>>>>>>> attachments supported in the SP connector?  E.g. if I
have a Word doc
>>>>>>>>> attached to a Task list item, would that be extracted?
 So far, I see that
>>>>>>>>> library content gets crawled and I'm getting the list
item data but am not
>>>>>>>>> sure what happens to the attachments.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>
>>>>>>>>>> Thanks for the additional information.  It does appear
like the
>>>>>>>>>> method that lists subsites is not working as expected
under AWS.  Nor are
>>>>>>>>>> some number of other methods which supposedly just
list the children of a
>>>>>>>>>> subsite.
>>>>>>>>>>
>>>>>>>>>> I've reopened CONNECTORS-772 to work on addressing
this issue.
>>>>>>>>>> Please stay tuned.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg
<
>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> Most of the paths that get generated are listed
in the attached
>>>>>>>>>>> log, they match what shows up in the diag report.
So I'm not sure where
>>>>>>>>>>> they diverge, most of them just don't seem right.
 There are 3 subsites
>>>>>>>>>>> rooted in the main site: Abcd, Defghij, Klmnopqr.
 It's strange that the
>>>>>>>>>>> connector would try such paths as:
>>>>>>>>>>>
>>>>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements///
-- there are
>>>>>>>>>>> multiple repetitions of the same subsite on the
path and to begin with,
>>>>>>>>>>> Defghij is not a subsite of Klmnopqr, so why
would it try this? the /// at
>>>>>>>>>>> the end doesn't seem correct either, unless I'm
missing something in how
>>>>>>>>>>> this pathing works.
>>>>>>>>>>>
>>>>>>>>>>> /Test Library
>>>>>>>>>>> 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements
-- looks wrong. A
>>>>>>>>>>> docname is mixed into the path, a subsite ends
up after a docname?...
>>>>>>>>>>>
>>>>>>>>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/
>>>>>>>>>>> -- same types of issues plus now somehow the
docname got split with a
>>>>>>>>>>> forward slash?..
>>>>>>>>>>>
>>>>>>>>>>> There are also a bunch of StringIndexOutOfBoundsException's.
>>>>>>>>>>> Perhaps this logic doesn't fit with the pathing
we're seeing on this
>>>>>>>>>>> amz-based installation?
>>>>>>>>>>>
>>>>>>>>>>> I'd expect the logic to just know that root contains
3 subsites,
>>>>>>>>>>> and work off that. Each subsite has a specific
list of libraries and lists,
>>>>>>>>>>> etc. It seems odd that the connector gets into
this matching pattern, and
>>>>>>>>>>> tries what looks like thousands of variations
(I aborted the execution).
>>>>>>>>>>>
>>>>>>>>>>> - Dmitry
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>
>>>>>>>>>>>> To clarify, the way you would need to analyze
this is to run a
>>>>>>>>>>>> crawl with the wildcards as you have selected,
abort if necessary after a
>>>>>>>>>>>> while, and then use the Document Status report
to list the document
>>>>>>>>>>>> identifiers that had been generated.  Find
a document identifier that you
>>>>>>>>>>>> believe represents a path that is illegal,
and figure out what SOAP
>>>>>>>>>>>> getChild call caused the problem by returning
incorrect data.  In other
>>>>>>>>>>>> words, find the point in the path where the
path diverges from what exists
>>>>>>>>>>>> into what doesn't exist, and go back in the
ManifoldCF logs to find the
>>>>>>>>>>>> particular SOAP request that led to the issue.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd expect from your description that the
problem lies with
>>>>>>>>>>>> getting child sites given a site path, but
that's just a guess at this
>>>>>>>>>>>> point.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't understand what you mean by "I've
tried the set of
>>>>>>>>>>>>> wildcards as below and I seem to be running
into a lot of cycles, where
>>>>>>>>>>>>> various subsite folders are appended
to each other and an extraction of
>>>>>>>>>>>>> data at all of those locations is attempted".
  If you are seeing cycles it
>>>>>>>>>>>>> means that document discovery is still
failing in some way.  For each
>>>>>>>>>>>>> folder/library/site/subsite, only the
children of that
>>>>>>>>>>>>> folder/library/site/subsite should be
appended to the path - ever.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you can give a specific example, preferably
including the
>>>>>>>>>>>>> soap back-and-forth, that would be very
helpful.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry
Goldenberg <
>>>>>>>>>>>>> dgoldenberg@kmwllc.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Quick question. Is there an easy
way to configure an SP repo
>>>>>>>>>>>>>> connection for crawling of all content,
from the root site all the way down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've tried the set of wildcards as
below and I seem to be
>>>>>>>>>>>>>> running into a lot of cycles, where
various subsite folders are appended to
>>>>>>>>>>>>>> each other and an extraction of data
at all of those locations is
>>>>>>>>>>>>>> attempted. Ideally I'd like to avoid
having to construct an exact set of
>>>>>>>>>>>>>> paths because the set may change,
especially with new content being added.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Path rules:
>>>>>>>>>>>>>> /* file include
>>>>>>>>>>>>>> /* library include
>>>>>>>>>>>>>> /* list include
>>>>>>>>>>>>>> /* site include
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Metadata:
>>>>>>>>>>>>>> /* include true
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd also like to pull down any files
attached to list items.
>>>>>>>>>>>>>> I'm hoping that some type of "/*
file include" should do it, once I figure
>>>>>>>>>>>>>> out how to safely include all content.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message