Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 21B5E10A0B for ; Tue, 19 Nov 2013 01:39:28 +0000 (UTC) Received: (qmail 65219 invoked by uid 500); 19 Nov 2013 01:39:28 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 65185 invoked by uid 500); 19 Nov 2013 01:39:28 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 65177 invoked by uid 99); 19 Nov 2013 01:39:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 01:39:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.128.54 as permitted sender) Received: from [209.85.128.54] (HELO mail-qe0-f54.google.com) (209.85.128.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 01:39:22 +0000 Received: by mail-qe0-f54.google.com with SMTP id 1so4740410qec.27 for ; Mon, 18 Nov 2013 17:39:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ZitIDTRDiGFY2ipQxLYUTHnEQqjdZqjwWwgg/3Hjqy8=; b=of2ZX0P54IW9AKwA9VHHmbV+bakTazX7i79MooDLcVVdcBFq/+zX2T+T61qzOtnrnR Arxnrg2d8cMg0THW+tXWWmYjRusj04LIo0LPf0AmbZMeGdzAYJEPAqtRBxm13PJbuHsW spp++fu7N1qdv8LpQBhR+A3eSbTqY8T5eGshoAWn1BYjOCAK9JkS/iobI8+Ltm0ZRFKv yQtuH0uwCDTTLifIoem0Y2bU8Kn8Pz8c9q58BTCzzqflCCsuJTDU8yRlMT8EPBfZ0G1j /ADn4vNwgGNYf5GXCKbfAAKitsBoxm8APf7qwh/xXEUNO/KJeq4fD3agvYDbJrorsmmA X3cQ== MIME-Version: 1.0 X-Received: by 10.224.112.134 with SMTP id w6mr38191717qap.21.1384825141777; Mon, 18 Nov 2013 17:39:01 -0800 (PST) Received: by 10.96.177.35 with HTTP; Mon, 18 Nov 2013 17:39:01 -0800 (PST) In-Reply-To: References: Date: Mon, 18 Nov 2013 20:39:01 -0500 Message-ID: Subject: Re: Crawling all of a SharePoint site From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a11c339e8a6f73e04eb7dba06 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c339e8a6f73e04eb7dba06 Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, Is "Cache Profiles" a list in your SharePoint? If not, what is it? Karl On Mon, Nov 18, 2013 at 8:37 PM, Mark Libucha wrote: > Hi Karl, > > It's not the first problem you mentioned. I don't have a site specified in > my SP connection. But it could well be the misconfigured IIS issue... > > Here's what I get with your modified log message: > > ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception tossed: > Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000' > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path > to start with /Lists/, saw: '/Cache Profiles/1_.000' > > Thanks, > > Mark > > > > On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright wrote: > >> Hi Mark, >> >> The exception is very helpful. >> >> I've seen this before. I know of two ways it can happen. >> >> First way: your Repository Connection is not actually pointing at the >> SharePoint root, but rather a subsite of the root. That usually messes >> things up pretty well - and it's not easy to detect in the connector >> properly either. You must point at the actual root, not a subsite, and use >> the criteria to limit what you include. >> >> Second way: your SharePoint instance has a malconfigured IIS, which is >> mapping paths in ways that are unexpected. >> >> There may be other ways that this can happen; SharePoint has a myriad >> different configuration options and it is possible your instance has one >> that is not something we've ever seen before. If you think that is what is >> happening, change this line: >> >> throw new ManifoldCFException("Expected path to start with >> /Lists/"); >> >> to: >> >> throw new ManifoldCFException("Expected path to start with >> /Lists/, saw: '"+relPath+"'"); >> >> Karl >> >> >> >> >> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha wrote: >> >>> Screen shot attached. Using 4.1, SharePoint 2010. >>> >>> Throws this exception: >>> >>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception tossed: >>> Expected path to start with /Lists/ >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path >>> to start with /Lists/ >>> at >>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255) >>> >>> I added a debug log message to the SharePoint crawler so the line number >>> may be off by 1 or 2... >>> >>> Thanks, >>> >>> Mark >>> >>> >>> >>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright wrote: >>> >>>> Hi Mark, >>>> >>>> First, what version of ManifoldCF are you using? 1.3 has some bugs >>>> where lists are concerned. >>>> >>>> Second, I've recently and repeatedly run exactly this crawl against a >>>> site that one of our ManifoldCF users set up in Amazon, so I know it works >>>> properly. So now the question is to determine exactly what you are doing >>>> that is not correct. >>>> >>>> If you want to crawl just lists, you will nevertheless need to enter >>>> both a Site match and a List match. Otherwise you will get nothing, >>>> because no sites can be crawled. >>>> >>>> To enter ANY of the rules I specified above, type a "*" in the type-in >>>> box, then select "Add Text". Then, select one of "File","Site","List",or >>>> "Library" from the pulldown, and then click the "Add new Rule" button. The >>>> Metadata tab works similarly. >>>> >>>> If you want me to verify you have done this correctly, please include a >>>> screen shot of the job's View page. >>>> >>>> If this still isn't helping you, please include a screen shot of the >>>> Simple History report after you have run a crawl. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> >>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha wrote: >>>> >>>>> I've seen this issue come up before, but I'd like to hear more about >>>>> it (Karl), if there is more to say about it... >>>>> >>>>> Why isn't there an option to crawl an entire SharePoint site. I mean >>>>> it's awesome that the UI gives us the option of drilling down dynamically >>>>> and specifying exactly which parts we want crawled, but isn't the default >>>>> case for most users to just crawl the whole thing? >>>>> >>>>> So, why exactly is this not an option, and what would adding that >>>>> functionality (I would be volunteering to try this) be feasible? >>>>> >>>>> On a more specific level, Karl wrote this in an earlier thread: >>>>> >>>>> >>>>> For SharePoint, if you want to crawl everything beneath your root site, >>>>> the simplest way is to define 4 rules: >>>>> (1) SITE rule "/*" >>>>> (2) LIST rule "/*" >>>>> (3) LIBRARY rule "/*" >>>>> (4) FILE rule "/*" >>>>> >>>>> >>>>> I haven't be able to get this to work. It only seems to get files. >>>>> >>>>> Limiting the scope to just Lists, when I use "/*" and specify List, I >>>>> get nothing crawled. Also tried "/Lists/*". Still nothing. >>>>> >>>>> Maybe I'm not specifying the Metadata correctly? Could you expand on >>>>> this Karl? What exactly needs to be specified to crawl all Lists? If I can >>>>> get that to work I can probably figure out the rest of it. >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>>> >>>>> >>>> >>> >> > --001a11c339e8a6f73e04eb7dba06 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Mark,

Is "Cache Profiles" a= list in your SharePoint?=A0 If not, what is it?

Karl

<= div class=3D"gmail_extra">

On Mon, Nov 18= , 2013 at 8:37 PM, Mark Libucha <mlibucha@gmail.com> wrote:=
Hi Karl,
=
It's not the first problem you mentioned. I don't have a site s= pecified in my SP connection. But it could well be the misconfigured IIS is= sue...

Here's what I get with your modified log message:

ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception t= ossed: Expected path to start with /Lists/, saw: '/Cache Profiles/1_.00= 0'
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expect= ed path to start with /Lists/, saw: '/Cache Profiles/1_.000'

Thanks,

Mark



On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Mark,

T= he exception is very helpful.

I've seen this before.=A0 I know o= f two ways it can happen.

First way: your Repository Connection is not actually pointing at the S= harePoint root, but rather a subsite of the root.=A0 That usually messes th= ings up pretty well - and it's not easy to detect in the connector prop= erly either.=A0 You must point at the actual root, not a subsite, and use t= he criteria to limit what you include.

Second way: your SharePoint instance has a malconfigured IIS, which is = mapping paths in ways that are unexpected.

There may be other = ways that this can happen; SharePoint has a myriad different configuration = options and it is possible your instance has one that is not something we&#= 39;ve ever seen before.=A0 If you think that is what is happening, change t= his line:

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 throw new ManifoldCFException("E= xpected path to start with /Lists/");

to:

=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 throw new ManifoldCFException("Expected= path to start with /Lists/, saw: '"+relPath+"'");
Karl




On= Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <mlibucha@gmail.com>= wrote:
Screen shot attached. = Using 4.1, SharePoint 2010.

Throws this exception:

ERROR 2013= -11-18 20:12:58,058 (Worker thread '13') - Exception tossed: Expect= ed path to start with /Lists/
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path to= start with /Lists/
=A0=A0=A0 at org.apache.manifoldcf.crawler.connector= s.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointReposito= ry.java:2255)

I added a debug log message to the SharePoint crawler so the line= number may be off by 1 or 2...

Thanks,

Mark



On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright= <daddywri@gmail.com> wrote:
Hi Mark= ,

First, what version of ManifoldCF are you using?=A0 1.3 has = some bugs where lists are concerned.

Second, I've recently and repeatedly run exactly this crawl against= a site that one of our ManifoldCF users set up in Amazon, so I know it wor= ks properly.=A0 So now the question is to determine exactly what you are do= ing that is not correct.

If you want to crawl just lists, you will nevertheless need to en= ter both a Site match and a List match.=A0 Otherwise you will get nothing, = because no sites can be crawled.

To enter ANY of the rules I s= pecified above, type a "*" in the type-in box, then select "= Add Text".=A0 Then, select one of "File","Site",&q= uot;List",or "Library" from the pulldown, and then click the= "Add new Rule" button.=A0 The Metadata tab works similarly.

If you want me to verify you have done this correctly, please include a= screen shot of the job's View page.

If this still isn'= ;t helping you, please include a screen shot of the Simple History report a= fter you have run a crawl.

Thanks,
Karl


<= br>
On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha= <mlibucha@gmail.com> wrote:
I'v= e seen this issue come up before, but I'd like to hear more about it (K= arl), if there is more to say about it...

Why isn't there an option to crawl an entire SharePoint site.= I mean it's awesome that the UI gives us the option of drilling down d= ynamically and specifying exactly which parts we want crawled, but isn'= t the default case for most users to just crawl the whole thing?

So, why exactly is this not an option, and what would adding that= functionality (I would be volunteering to try this) be feasible?

On a more specific level, Karl wrote this in an earlier thread:

<quote>
For SharePoint, if you want to crawl<= /span> everything beneath your root site, the simplest way is = to define 4 rules:
(1) SITE rule "/*"
(2) LIST rule "/*"
(3) LIBRARY rule "/*"
(4) FILE= rule "/*"
</quote>

I haven= 't be able to get this to work. It only seems to get files.

Limiting the scope to just Lists, when I use "/*" and specify Lis= t, I get nothing crawled. Also tried "/Lists/*". Still nothing.
Maybe I'm not specifying the Metadata correctly? Could= you expand on this Karl? What exactly needs to be specified to crawl all L= ists? If I can get that to work I can probably figure out the rest of it.
Thanks,

Mark






--001a11c339e8a6f73e04eb7dba06--