manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Silvia, Daniel [USA]" <Silvia_Dan...@bah.com>
Subject RE: Web Crawl using ManifoldCF
Date Wed, 08 Feb 2012 14:11:03 GMT
Thanks Karl

________________________________________
From: Karl Wright [daddywri@gmail.com]
Sent: Wednesday, February 08, 2012 8:40 AM
To: Silvia, Daniel [USA]
Cc: connectors-user@incubator.apache.org
Subject: Re: Web Crawl using ManifoldCF

On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA]
<Silvia_Daniel@bah.com> wrote:
> Hi Carl
>
>
>
> I want to thank you for your help regarding the Sharepoint to Solr
> connections, everything seems to be working properly after getting the
> Viewers and Home Owners groups permission set properly by our SharePoint
> Admins.

That's great news!  Thanks for sticking with it. ;-)

> However, I have another question regarding pulling site content from
> the SharePoint instance and not the files stored on the SharePoint instance.
>
>
>
> When creating a Respository connection, would you use the "Web" connection
> type to pull site content? If that is the case, when creating the job, do
> you indicate just the site url you want to crawl to pull site content in the
> "Seed" tab? Are we using the correct connection repository? Is there a
> respository type we use to just crawl websites for the content and not
> files?
>
>

I think that's the right approach, if there's a document you can crawl
somewhere that has a reference to the other documents, or the
documents all refer to each other.  You need such a document or
documents at the root of a document web, otherwise a web crawler has
no way of locating the documents in question.  That would be how you
identify your "seed" document.  For typical (non SharePoint) sites,
that's usually the main URL of the site.  So, for example, if you
wanted to crawl cnn.com you'd probably use a seed of
http://www.cnn.com, because that's a good place to start to get to all
of cnn's content.

If no such document(s) exist, then web crawling is not going to do it.
 If this "site" is served by SharePoint, then some kind of enhancement
to the SharePoint connector would be a better approach.

Thanks,
Karl

>
> As you can see, I hope I have explained myself properly, we are trying to
> just crawl site content.
>
>
>
> Thanks
>
>
>
> Dan
Mime
View raw message