manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Libucha <mlibu...@gmail.com>
Subject Re: Crawling all of a SharePoint site
Date Tue, 19 Nov 2013 02:52:45 GMT
My svn (ubuntu 12.04) is old enough that it doesn't support patch. Had to
do it manually and had the URL for the patch wrong. Finally got that fixed,
and...

It's totally working now -- for Lists anyway. Thanks!

I'll do some more testing to make sure I'm getting everything else.

Thanks again, Karl.

Mark


On Mon, Nov 18, 2013 at 6:30 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Mark,
>
> The patch removed the exception toss entirely, so I don't think you
> applied it right.
>
> Can you do the following:
>
> cd trunk
> svn revert
> connectors/sharepoint/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharepoint/SharePointRepository.java
> svn patch CONNECTORS-812.patch
> ant clean build
>
> Thanks!
> Karl
>
>
>
> On Mon, Nov 18, 2013 at 9:27 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>
>> I *think* I applied the patch correctly. Got a new error:
>>
>> ERROR 2013-11-18 21:25:47,994 (Worker thread '1') - Exception tossed:
>> Expected path to start with /Lists/, saw: '/Relationships List/1_.000'
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
>> to start with /Lists/, saw: '/Relationships List/1_.000'
>>
>> http://msdn.microsoft.com/en-us/library/ff798514.aspx
>>
>> Mark
>>
>>
>> On Mon, Nov 18, 2013 at 5:53 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Ok, patch attached.
>>>
>>> One of two things will happen with this patch:
>>> (1) It will work
>>> (2) It will crawl to completion but not get any list rows
>>>
>>> If it is the latter, it means that SharePoint operating in this mode
>>> REPLACES the list items with some funky cache URL, rather than augmenting
>>> them.  So please send me the log output if that happens.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hah.  Exactly the kind of configuration difference I was expecting.
>>>> Whatever it is, it's showing up as a list.
>>>>
>>>> I'll open a ticket, and propose a patch; let's see if that gets us past
>>>> this.
>>>>
>>>> The ticket is CONNECTORS-812.  I should have a patch in a few minutes,
>>>> attached to the ticket.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 18, 2013 at 8:41 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>
>>>>> Seems to be a SP-internal thing.
>>>>>
>>>>> http://msdn.microsoft.com/en-us/library/aa661294.ASPX
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Mon, Nov 18, 2013 at 5:39 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Is "Cache Profiles" a list in your SharePoint?  If not, what is it?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 18, 2013 at 8:37 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> It's not the first problem you mentioned. I don't have a site
>>>>>>> specified in my SP connection. But it could well be the misconfigured
IIS
>>>>>>> issue...
>>>>>>>
>>>>>>> Here's what I get with your modified log message:
>>>>>>>
>>>>>>> ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception
>>>>>>> tossed: Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected
>>>>>>> path to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> The exception is very helpful.
>>>>>>>>
>>>>>>>> I've seen this before.  I know of two ways it can happen.
>>>>>>>>
>>>>>>>> First way: your Repository Connection is not actually pointing
at
>>>>>>>> the SharePoint root, but rather a subsite of the root.  That
usually messes
>>>>>>>> things up pretty well - and it's not easy to detect in the
connector
>>>>>>>> properly either.  You must point at the actual root, not
a subsite, and use
>>>>>>>> the criteria to limit what you include.
>>>>>>>>
>>>>>>>> Second way: your SharePoint instance has a malconfigured
IIS, which
>>>>>>>> is mapping paths in ways that are unexpected.
>>>>>>>>
>>>>>>>> There may be other ways that this can happen; SharePoint
has a
>>>>>>>> myriad different configuration options and it is possible
your instance has
>>>>>>>> one that is not something we've ever seen before.  If you
think that is
>>>>>>>> what is happening, change this line:
>>>>>>>>
>>>>>>>>             throw new ManifoldCFException("Expected path
to start
>>>>>>>> with /Lists/");
>>>>>>>>
>>>>>>>> to:
>>>>>>>>
>>>>>>>>             throw new ManifoldCFException("Expected path
to start
>>>>>>>> with /Lists/, saw: '"+relPath+"'");
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Screen shot attached. Using 4.1, SharePoint 2010.
>>>>>>>>>
>>>>>>>>> Throws this exception:
>>>>>>>>>
>>>>>>>>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') -
Exception
>>>>>>>>> tossed: Expected path to start with /Lists/
>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>> Expected path to start with /Lists/
>>>>>>>>>     at
>>>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>>>>>>>>>
>>>>>>>>> I added a debug log message to the SharePoint crawler
so the line
>>>>>>>>> number may be off by 1 or 2...
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mark,
>>>>>>>>>>
>>>>>>>>>> First, what version of ManifoldCF are you using?
 1.3 has some
>>>>>>>>>> bugs where lists are concerned.
>>>>>>>>>>
>>>>>>>>>> Second, I've recently and repeatedly run exactly
this crawl
>>>>>>>>>> against a site that one of our ManifoldCF users set
up in Amazon, so I know
>>>>>>>>>> it works properly.  So now the question is to determine
exactly what you
>>>>>>>>>> are doing that is not correct.
>>>>>>>>>>
>>>>>>>>>> If you want to crawl just lists, you will nevertheless
need to
>>>>>>>>>> enter both a Site match and a List match.  Otherwise
you will get nothing,
>>>>>>>>>> because no sites can be crawled.
>>>>>>>>>>
>>>>>>>>>> To enter ANY of the rules I specified above, type
a "*" in the
>>>>>>>>>> type-in box, then select "Add Text".  Then, select
one of
>>>>>>>>>> "File","Site","List",or "Library" from the pulldown,
and then click the
>>>>>>>>>> "Add new Rule" button.  The Metadata tab works similarly.
>>>>>>>>>>
>>>>>>>>>> If you want me to verify you have done this correctly,
please
>>>>>>>>>> include a screen shot of the job's View page.
>>>>>>>>>>
>>>>>>>>>> If this still isn't helping you, please include a
screen shot of
>>>>>>>>>> the Simple History report after you have run a crawl.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <mlibucha@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> I've seen this issue come up before, but I'd
like to hear more
>>>>>>>>>>> about it (Karl), if there is more to say about
it...
>>>>>>>>>>>
>>>>>>>>>>> Why isn't there an option to crawl an entire
SharePoint site. I
>>>>>>>>>>> mean it's awesome that the UI gives us the option
of drilling down
>>>>>>>>>>> dynamically and specifying exactly which parts
we want crawled, but isn't
>>>>>>>>>>> the default case for most users to just crawl
the whole thing?
>>>>>>>>>>>
>>>>>>>>>>> So, why exactly is this not an option, and what
would adding
>>>>>>>>>>> that functionality (I would be volunteering to
try this) be feasible?
>>>>>>>>>>>
>>>>>>>>>>> On a more specific level, Karl wrote this in
an earlier thread:
>>>>>>>>>>>
>>>>>>>>>>> <quote>
>>>>>>>>>>> For SharePoint, if you want to crawl everything
beneath your
>>>>>>>>>>> root site, the simplest way is to define 4 rules:
>>>>>>>>>>> (1) SITE rule "/*"
>>>>>>>>>>> (2) LIST rule "/*"
>>>>>>>>>>> (3) LIBRARY rule "/*"
>>>>>>>>>>> (4) FILE rule "/*"
>>>>>>>>>>> </quote>
>>>>>>>>>>>
>>>>>>>>>>> I haven't be able to get this to work. It only
seems to get
>>>>>>>>>>> files.
>>>>>>>>>>>
>>>>>>>>>>> Limiting the scope to just Lists, when I use
"/*" and specify
>>>>>>>>>>> List, I get nothing crawled. Also tried "/Lists/*".
Still nothing.
>>>>>>>>>>>
>>>>>>>>>>> Maybe I'm not specifying the Metadata correctly?
Could you
>>>>>>>>>>> expand on this Karl? What exactly needs to be
specified to crawl all Lists?
>>>>>>>>>>> If I can get that to work I can probably figure
out the rest of it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message