Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DDFB49B02 for ; Wed, 8 Feb 2012 14:11:31 +0000 (UTC) Received: (qmail 13932 invoked by uid 500); 8 Feb 2012 14:11:31 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 13846 invoked by uid 500); 8 Feb 2012 14:11:31 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 13838 invoked by uid 99); 8 Feb 2012 14:11:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2012 14:11:30 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [128.229.5.20] (HELO mclniron01-ext.bah.com) (128.229.5.20) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2012 14:11:24 +0000 x-SBRS: None X-REMOTE-IP: 10.12.10.201 X-Cloudmark-SP-Filtered: true X-Cloudmark-SP-Result: v=1.1 cv=VDa2sVBZJ9ly/wN+GtI6CrmodIMotqIObqDvHFqZThY= c=1 sm=2 a=CkWPJSJ_XbcA:10 a=VSl--tYzzWsA:10 a=BLceEmwcHowA:10 a=kj9zAlcOel0A:10 a=xqWC_Br6kY4A:10 a=pGLkceISAAAA:8 a=mV9VRH-2AAAA:8 a=lGcAVYM4AAAA:8 a=HgbDHJuKAAAA:8 a=azlCB2PEGT_rRTjkG0YA:9 a=KXaU-7myVDhrxIga4YsA:7 a=CjuIK1q_8ugA:10 a=B9WObwvXIC4A:10 a=MSl-tDqOz04A:10 a=kjO27gckG74A:10 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAOmBMk8KDArJ/2dsb2JhbAA5CqM9jTSBcgEBAQMBOj8FCwIBCA0BAwQBAQEKFBAhER0IAQEEDgUIEodiuW+IHE6CVgEGAQEBCgkRBQMGAYM+KW0DNYJXYwSbJ4USh1E X-IronPort-AV: E=Sophos;i="4.73,383,1325480400"; d="scan'208";a="119022344" Received: from unknown (HELO ASHBCSHB02.resource.ds.bah.com) ([10.12.10.201]) by mclniron01-int.bah.com with ESMTP; 08 Feb 2012 09:11:03 -0500 Received: from ASHBDAG2M4.resource.ds.bah.com ([fe80::cc6:899:b51e:568e]) by ASHBCSHB02.resource.ds.bah.com ([10.12.10.201]) with mapi id 14.01.0355.002; Wed, 8 Feb 2012 09:11:03 -0500 From: "Silvia, Daniel [USA]" To: Karl Wright CC: "connectors-user@incubator.apache.org" Subject: RE: Web Crawl using ManifoldCF Thread-Topic: Web Crawl using ManifoldCF Thread-Index: AczmYNHOzqabfpa8Qo6YekkNzz3MiQAMEtMA//+0tlY= Date: Wed, 8 Feb 2012 14:11:03 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.12.230.71] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Thanks Karl ________________________________________ From: Karl Wright [daddywri@gmail.com] Sent: Wednesday, February 08, 2012 8:40 AM To: Silvia, Daniel [USA] Cc: connectors-user@incubator.apache.org Subject: Re: Web Crawl using ManifoldCF On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA] wrote: > Hi Carl > > > > I want to thank you for your help regarding the Sharepoint to Solr > connections, everything seems to be working properly after getting the > Viewers and Home Owners groups permission set properly by our SharePoint > Admins. That's great news! Thanks for sticking with it. ;-) > However, I have another question regarding pulling site content from > the SharePoint instance and not the files stored on the SharePoint instan= ce. > > > > When creating a Respository connection, would you use the "Web" connectio= n > type to pull site content? If that is the case, when creating the job, do > you indicate just the site url you want to crawl to pull site content in = the > "Seed" tab? Are we using the correct connection repository? Is there a > respository type we use to just crawl websites for the content and not > files? > > I think that's the right approach, if there's a document you can crawl somewhere that has a reference to the other documents, or the documents all refer to each other. You need such a document or documents at the root of a document web, otherwise a web crawler has no way of locating the documents in question. That would be how you identify your "seed" document. For typical (non SharePoint) sites, that's usually the main URL of the site. So, for example, if you wanted to crawl cnn.com you'd probably use a seed of http://www.cnn.com, because that's a good place to start to get to all of cnn's content. If no such document(s) exist, then web crawling is not going to do it. If this "site" is served by SharePoint, then some kind of enhancement to the SharePoint connector would be a better approach. Thanks, Karl > > As you can see, I hope I have explained myself properly, we are trying to > just crawl site content. > > > > Thanks > > > > Dan=