Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 342AD17D0B for ; Tue, 7 Oct 2014 14:21:20 +0000 (UTC) Received: (qmail 45447 invoked by uid 500); 7 Oct 2014 14:21:20 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 45400 invoked by uid 500); 7 Oct 2014 14:21:20 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 45389 invoked by uid 99); 7 Oct 2014 14:21:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 14:21:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.213.45 as permitted sender) Received: from [209.85.213.45] (HELO mail-yh0-f45.google.com) (209.85.213.45) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 14:21:14 +0000 Received: by mail-yh0-f45.google.com with SMTP id b6so3012580yha.18 for ; Tue, 07 Oct 2014 07:20:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=A+iCRIcWNt4RFb8PqUaGhcuzhzJAlNXKhyzzWRcaUlI=; b=lMLi3j6BNj4qh+lUKvEawqCoBBnEGgBhpxxrHOLoPbf/Cu3KP2dpkXwWJ0pzKlWZyO syYPERtUsUiMRvFQKs5ND1sfXtyaGbLxILe3CDuz1FScJ4u+KU6+LzNjBYkAd4SgHP9D 9IjXwnDjXmSE64Zu0PsU2dJyz4u4AEZBSNAAh7IPET3HUKZJqrYjoge/K38Fh/entHth 7totV/p6w+otWmB1J778rjv9JBBecqdW/PFw4X+8Rk2W+Ky0YQjMNIRhuIfeElq+MrXR 6/Ky3GlLUK7GvZI7GKOdq49yVrPswKdM5vMb/LjGr6meJD3cWrJe5vqDufsZnV8cuoAX mToQ== MIME-Version: 1.0 X-Received: by 10.236.124.33 with SMTP id w21mr6073056yhh.73.1412691653834; Tue, 07 Oct 2014 07:20:53 -0700 (PDT) Received: by 10.170.189.214 with HTTP; Tue, 7 Oct 2014 07:20:53 -0700 (PDT) In-Reply-To: References: <-1771423698352224758@unknownmsgid> Date: Tue, 7 Oct 2014 10:20:53 -0400 Message-ID: Subject: Re: regarding crawl parameters From: Karl Wright To: Jitu Cc: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=20cf300e508f343bf60504d5e870 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300e508f343bf60504d5e870 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Jitu, I know of no way to crawl only those documents that were created after a specified date. SharePoint crawling involves walking a tree, not querying SharePoint for a list of documents that fulfills a specific criteria. What this means is that we will need to crawl the entire tree *regardless* of what documents we decide to index. We can filter the discovered documents by looking at their creation date, and exclude those last modified prior to 2011-01-01 from being indexed. That would cut down on the work that your index needs to do, and the work of actually fetching the content itself. But we would still need to crawl all documents. Karl On Tue, Oct 7, 2014 at 10:11 AM, Jitu wrote: > Hi Karl, > > Here is the requirement: > > One of our customers would like to selectively publish the documents from > his SharePoint which is over grown in size in due course. Since filtering > based on folder names is not an easy task, he likes us to crawl all the > documents created in sharepoint between 2 dates. > > > > All documents created/modified between 2011-01-01 till 2013-12-31 are > needed to crawl and if that is possible to do, then the additional filter= s > get added to the date range. Ex: get only the Docx and Doc files created > between 2011-01-01 to 2013-12-31 etc=E2=80=A6 > > > similarly all documents created/modified in last 2 months etc... > > > Thanks, > > Jitu > > On Mon, Oct 6, 2014 at 5:04 PM, Karl Wright wrote: > >> Hi Jitu, >> >> Did you ever figure out what the customer requirement really was here? >> >> Thanks, >> Karl >> >> >> On Fri, Oct 3, 2014 at 6:09 PM, Karl Wright wrote: >> >>> Hi Jitu, >>> >>> SharePoint does not provide a way to crawl documents by date range, so >>> all documents will need to be crawled regardless of any date range >>> requirement, and then filtered. >>> >>> So at this point it is important to ask the client if their >>> requirement's purpose is to save crawling load on the server, because i= f it >>> is, you won't get much savings. But if the client wants this feature f= or >>> other reasons, we can support it with some work. >>> >>> Please open a ticket if you find that the client has a legitimate reaso= n >>> for this requirement. >>> >>> Karl >>> >>> Sent from my Windows Phone >>> ------------------------------ >>> From: Jitu >>> Sent: 10/3/2014 4:22 PM >>> To: user@manifoldcf.apache.org >>> Subject: regarding crawl parameters >>> >>> Hi Karl, >>> >>> Thanks for your continuous support. we have a requirement from our >>> client to crawl files which are created/modified in last one month or 2 >>> months from share point server and that parameter should be configurabl= e in >>> gui. we are using manifoldcf 1.7 version. Is there a way to achieve thi= s. >>> Please help. >>> >>> Thanks, >>> Jitu >>> >> >> > --20cf300e508f343bf60504d5e870 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Jitu,

I know of no way to c= rawl only those documents that were created after a specified date.=C2=A0 S= harePoint crawling involves walking a tree, not querying SharePoint for a l= ist of documents that fulfills a specific criteria.

What this = means is that we will need to crawl the entire tree *regardless* of what do= cuments we decide to index.=C2=A0 We can filter the discovered documents by= looking at their creation date, and exclude those last modified prior to 2= 011-01-01 from being indexed.=C2=A0 That would cut down on the work that yo= ur index needs to do, and the work of actually fetching the content itself.= =C2=A0 But we would still need to crawl all documents.

Karl

On Tue= , Oct 7, 2014 at 10:11 AM, Jitu <abjitu@gmail.com> wrote:
=
Hi Karl,

Here= is the requirement:

One of our customers would like to selectively publish the documents from=20 his SharePoint which is over grown in size in due course. Since=20 filtering based on folder names is not an easy task, he likes us to=20 crawl all the documents created in sharepoint between 2 dates.

<= p class=3D"MsoNormal">=C2=A0

All documents created/modified between 2011-01-01 till 2013-12-31 are=20 needed to crawl and if that is possible to do, then the additional=20 filters get added to the date range. Ex: get only the Docx and Doc files created between 2011-01-01 to 2013-12-31 etc=E2=80=A6


similarly all documents created/m= odified in last 2 months etc...


Thanks,

Jitu


On Mon, Oct 6, 2014 at = 5:04 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Jitu,

Did you ever figur= e out what the customer requirement really was here?

Thanks,
Karl=


On Fri, Oct 3, 2014 at 6:09 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Jitu,

SharePoi= nt does not provide a way to crawl documents by date range, so all document= s will need to be crawled regardless of any date range requirement, and the= n filtered.

So at this point it is important to ask the client if th= eir requirement's purpose is to save crawling load on the server, becau= se if it is, you won't get much savings.=C2=A0 But if the client wants = this feature for other reasons, we can support it with some work.

Pl= ease open a ticket if you find that the client has a legitimate reason for = this requirement.

Karl

Sent from my Windows Phone

From: Jitu
Sent: 10/3/2014 4:22 PM
To: user@manifoldcf.apache.org<= /span>
Subject: regarding crawl parameters

Hi Karl,

=C2=A0Thanks for your co= ntinuous support. we have a requirement from our client to crawl files whic= h are created/modified in last one month or 2 months from share point serve= r and that parameter should be configurable in gui. we are using manifoldcf= 1.7 version. Is there a way to achieve this. Please help.

Tha= nks,
Jitu



--20cf300e508f343bf60504d5e870--