Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@manifoldcf.apache.org
Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates
 209.85.128.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMEqr29Mjk5mxcfuKKWe-T=uw1jp0HG3cxxauArwAJTn59=hsQ@mail.gmail.com>
References: 
 <CAMEqr29S693kR9TbZEmJDgBi9Ug2LSKZLObjRfx8jY6p1PiiCg@mail.gmail.com>
	<CALUFAGBtfH6cq1KibbqN9AVAKpQ74WaPWXt+X58OzZZTcUC+Ew@mail.gmail.com>
	<CAMEqr29QUPMF5eUZF1vT+3hZAF1y+GAMoDNRiy=Cz+7TNSumeg@mail.gmail.com>
	<CALUFAGBTWrW98C1tktABvEXaOOjiO4=BmrWvj4y=zmN7EA7BcA@mail.gmail.com>
	<CAMEqr29Mjk5mxcfuKKWe-T=uw1jp0HG3cxxauArwAJTn59=hsQ@mail.gmail.com>
Date: Mon, 18 Nov 2013 20:39:01 -0500
Message-ID: 
 <CALUFAGBwG1k6jsMvXwxGarCyNiQC=fqxBXaAbiH=uE-XZxopyw@mail.gmail.com>
Subject: Re: Crawling all of a SharePoint site
From: Karl Wright <daddywri@gmail.com>
To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=001a11c339e8a6f73e04eb7dba06

--001a11c339e8a6f73e04eb7dba06
Content-Type: text/plain; charset=ISO-8859-1

Hi Mark,

Is "Cache Profiles" a list in your SharePoint?  If not, what is it?

Karl


On Mon, Nov 18, 2013 at 8:37 PM, Mark Libucha <mlibucha@gmail.com> wrote:

> Hi Karl,
>
> It's not the first problem you mentioned. I don't have a site specified in
> my SP connection. But it could well be the misconfigured IIS issue...
>
> Here's what I get with your modified log message:
>
> ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception tossed:
> Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000'
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
> to start with /Lists/, saw: '/Cache Profiles/1_.000'
>
> Thanks,
>
> Mark
>
>
>
> On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Mark,
>>
>> The exception is very helpful.
>>
>> I've seen this before.  I know of two ways it can happen.
>>
>> First way: your Repository Connection is not actually pointing at the
>> SharePoint root, but rather a subsite of the root.  That usually messes
>> things up pretty well - and it's not easy to detect in the connector
>> properly either.  You must point at the actual root, not a subsite, and use
>> the criteria to limit what you include.
>>
>> Second way: your SharePoint instance has a malconfigured IIS, which is
>> mapping paths in ways that are unexpected.
>>
>> There may be other ways that this can happen; SharePoint has a myriad
>> different configuration options and it is possible your instance has one
>> that is not something we've ever seen before.  If you think that is what is
>> happening, change this line:
>>
>>             throw new ManifoldCFException("Expected path to start with
>> /Lists/");
>>
>> to:
>>
>>             throw new ManifoldCFException("Expected path to start with
>> /Lists/, saw: '"+relPath+"'");
>>
>> Karl
>>
>>
>>
>>
>> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>>
>>> Screen shot attached. Using 4.1, SharePoint 2010.
>>>
>>> Throws this exception:
>>>
>>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception tossed:
>>> Expected path to start with /Lists/
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
>>> to start with /Lists/
>>>     at
>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>>>
>>> I added a debug log message to the SharePoint crawler so the line number
>>> may be off by 1 or 2...
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> First, what version of ManifoldCF are you using?  1.3 has some bugs
>>>> where lists are concerned.
>>>>
>>>> Second, I've recently and repeatedly run exactly this crawl against a
>>>> site that one of our ManifoldCF users set up in Amazon, so I know it works
>>>> properly.  So now the question is to determine exactly what you are doing
>>>> that is not correct.
>>>>
>>>> If you want to crawl just lists, you will nevertheless need to enter
>>>> both a Site match and a List match.  Otherwise you will get nothing,
>>>> because no sites can be crawled.
>>>>
>>>> To enter ANY of the rules I specified above, type a "*" in the type-in
>>>> box, then select "Add Text".  Then, select one of "File","Site","List",or
>>>> "Library" from the pulldown, and then click the "Add new Rule" button.  The
>>>> Metadata tab works similarly.
>>>>
>>>> If you want me to verify you have done this correctly, please include a
>>>> screen shot of the job's View page.
>>>>
>>>> If this still isn't helping you, please include a screen shot of the
>>>> Simple History report after you have run a crawl.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>
>>>>> I've seen this issue come up before, but I'd like to hear more about
>>>>> it (Karl), if there is more to say about it...
>>>>>
>>>>> Why isn't there an option to crawl an entire SharePoint site. I mean
>>>>> it's awesome that the UI gives us the option of drilling down dynamically
>>>>> and specifying exactly which parts we want crawled, but isn't the default
>>>>> case for most users to just crawl the whole thing?
>>>>>
>>>>> So, why exactly is this not an option, and what would adding that
>>>>> functionality (I would be volunteering to try this) be feasible?
>>>>>
>>>>> On a more specific level, Karl wrote this in an earlier thread:
>>>>>
>>>>> <quote>
>>>>> For SharePoint, if you want to crawl everything beneath your root site,
>>>>> the simplest way is to define 4 rules:
>>>>> (1) SITE rule "/*"
>>>>> (2) LIST rule "/*"
>>>>> (3) LIBRARY rule "/*"
>>>>> (4) FILE rule "/*"
>>>>> </quote>
>>>>>
>>>>> I haven't be able to get this to work. It only seems to get files.
>>>>>
>>>>> Limiting the scope to just Lists, when I use "/*" and specify List, I
>>>>> get nothing crawled. Also tried "/Lists/*". Still nothing.
>>>>>
>>>>> Maybe I'm not specifying the Metadata correctly? Could you expand on
>>>>> this Karl? What exactly needs to be specified to crawl all Lists? If I can
>>>>> get that to work I can probably figure out the rest of it.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>
>>>
>>
>

--001a11c339e8a6f73e04eb7dba06
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi Mark,<br><br></div>Is &quot;Cache Profiles&quot; a=
 list in your SharePoint?=A0 If not, what is it?<br><br>Karl<br><br></div><=
div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Nov 18=
, 2013 at 8:37 PM, Mark Libucha <span dir=3D"ltr">&lt;<a href=3D"mailto:mli=
bucha@gmail.com" target=3D"_blank">mlibucha@gmail.com</a>&gt;</span> wrote:=
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hi Karl,<br>=
<br>It&#39;s not the first problem you mentioned. I don&#39;t have a site s=
pecified in my SP connection. But it could well be the misconfigured IIS is=
sue...<br>
<br></div>Here&#39;s what I get with your modified log message:<br>
<br>ERROR 2013-11-18 20:35:47,440 (Worker thread &#39;7&#39;) - Exception t=
ossed: Expected path to start with /Lists/, saw: &#39;/Cache Profiles/1_.00=
0&#39;<br>org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expect=
ed path to start with /Lists/, saw: &#39;/Cache Profiles/1_.000&#39;<br>

<br></div>Thanks,<br><br></div>Mark<br><br></div><div class=3D"HOEnZb"><div=
 class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote"=
>On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <span dir=3D"ltr">&lt;<a href=
=3D"mailto:daddywri@gmail.com" target=3D"_blank">daddywri@gmail.com</a>&gt;=
</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hi Mark,<br><br>T=
he exception is very helpful.<br><br>I&#39;ve seen this before.=A0 I know o=
f two ways it can happen.<br>

<br>First way: your Repository Connection is not actually pointing at the S=
harePoint root, but rather a subsite of the root.=A0 That usually messes th=
ings up pretty well - and it&#39;s not easy to detect in the connector prop=
erly either.=A0 You must point at the actual root, not a subsite, and use t=
he criteria to limit what you include.<br>


<br>Second way: your SharePoint instance has a malconfigured IIS, which is =
mapping paths in ways that are unexpected.<br><br></div>There may be other =
ways that this can happen; SharePoint has a myriad different configuration =
options and it is possible your instance has one that is not something we&#=
39;ve ever seen before.=A0 If you think that is what is happening, change t=
his line:<br>


<br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 throw new ManifoldCFException(&quot;E=
xpected path to start with /Lists/&quot;);<br><br></div><div>to:<br><br>=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 throw new ManifoldCFException(&quot;Expected=
 path to start with /Lists/, saw: &#39;&quot;+relPath+&quot;&#39;&quot;);<b=
r>


<br></div><div>Karl<br></div><div><br></div><div><div><br></div></div></div=
><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On=
 Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <span dir=3D"ltr">&lt;<a href=
=3D"mailto:mlibucha@gmail.com" target=3D"_blank">mlibucha@gmail.com</a>&gt;=
</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Screen shot attached. =
Using 4.1, SharePoint 2010.<br><br>Throws this exception:<br><br>ERROR 2013=
-11-18 20:12:58,058 (Worker thread &#39;13&#39;) - Exception tossed: Expect=
ed path to start with /Lists/<br>


org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path to=
 start with /Lists/<br>=A0=A0=A0 at org.apache.manifoldcf.crawler.connector=
s.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointReposito=
ry.java:2255)<br>


<br></div>I added a debug log message to the SharePoint crawler so the line=
 number may be off by 1 or 2...<br><br>Thanks,<br><br>Mark<br><div><br></di=
v></div><div><div><div class=3D"gmail_extra"><br>
<br><div class=3D"gmail_quote">On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_bla=
nk">daddywri@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div>Hi Mark=
,<br><br></div>First, what version of ManifoldCF are you using?=A0 1.3 has =
some bugs where lists are concerned.<br>


<br>Second, I&#39;ve recently and repeatedly run exactly this crawl against=
 a site that one of our ManifoldCF users set up in Amazon, so I know it wor=
ks properly.=A0 So now the question is to determine exactly what you are do=
ing that is not correct.<br>


<br></div>If you want to crawl just lists, you will nevertheless need to en=
ter both a Site match and a List match.=A0 Otherwise you will get nothing, =
because no sites can be crawled.<br><br></div>To enter ANY of the rules I s=
pecified above, type a &quot;*&quot; in the type-in box, then select &quot;=
Add Text&quot;.=A0 Then, select one of &quot;File&quot;,&quot;Site&quot;,&q=
uot;List&quot;,or &quot;Library&quot; from the pulldown, and then click the=
 &quot;Add new Rule&quot; button.=A0 The Metadata tab works similarly.<br>


<br>If you want me to verify you have done this correctly, please include a=
 screen shot of the job&#39;s View page.<br><br></div>If this still isn&#39=
;t helping you, please include a screen shot of the Simple History report a=
fter you have run a crawl.<br>


<br>Thanks,<br>Karl<br><br></div><div><div><div class=3D"gmail_extra"><br><=
br><div class=3D"gmail_quote">On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:mlibucha@gmail.com" target=3D"_bla=
nk">mlibucha@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div>I&#39;v=
e seen this issue come up before, but I&#39;d like to hear more about it (K=
arl), if there is more to say about it...<br>


<br></div>Why isn&#39;t there an option to crawl an entire SharePoint site.=
 I mean it&#39;s awesome that the UI gives us the option of drilling down d=
ynamically and specifying exactly which parts we want crawled, but isn&#39;=
t the default case for most users to just crawl the whole thing?<br>


<br></div>So, why exactly is this not an option, and what would adding that=
 functionality (I would be volunteering to try this) be feasible?<br><br></=
div>On a more specific level, Karl wrote this in an earlier thread:<br>


<br></div>&lt;quote&gt;<br><div>For SharePoint, if you want to <span>crawl<=
/span> everything beneath your root <span>site</span>, the simplest way is =
to define 4 rules:<br>(1) <span>SITE</span> rule &quot;/*&quot;<br>

(2) LIST rule &quot;/*&quot;<br>(3) LIBRARY rule &quot;/*&quot;<br>(4) FILE=
 rule &quot;/*&quot;<br></div><div>&lt;/quote&gt;<br><br></div><div>I haven=
&#39;t be able to get this to work. It only seems to get files.<br><br>


Limiting the scope to just Lists, when I use &quot;/*&quot; and specify Lis=
t, I get nothing crawled. Also tried &quot;/Lists/*&quot;. Still nothing.<b=
r><br></div><div>Maybe I&#39;m not specifying the Metadata correctly? Could=
 you expand on this Karl? What exactly needs to be specified to crawl all L=
ists? If I can get that to work I can probably figure out the rest of it.<b=
r>


<br></div><div>Thanks,<br><br>Mark<br><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11c339e8a6f73e04eb7dba06--