manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: regarding crawl parameters
Date Fri, 10 Oct 2014 00:23:12 GMT
This code is now complete in both trunk and the dev_1x branch.

Karl


On Tue, Oct 7, 2014 at 11:18 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Jitu,
>
> I would suggest that we do not try for multiple date ranges, but just an
> "earliest document date" filtering parameter.  Adding this functionality to
> the Document Filter transformation connector would be what I'd do.  If
> necessary, we can also add an IOutputActivities method which will allow a
> connector to decide whether a document needs to be fetched or not based on
> its date stamp; this would help prevent unnecessary work opening older
> documents.
>
> Oddly enough, I think that the work involved would largely be in coming up
> with a reasonable date selection UI.
>
> If this sounds like it is what you want, please go ahead and create a
> ticket describing this functionality.
>
> Karl
>
>
> On Tue, Oct 7, 2014 at 10:35 AM, Jitu <abjitu@gmail.com> wrote:
>
>> Hi Karl,
>>            Thanks for the support. what you said is absolutely what we
>> are looking for too. Crawling is absolutely fine but we should not process
>> the documents until the criteria is met. here the criteria is file modified
>> during last 2 months or 3 months or date range.
>>
>> It is something similar to getDocumentVersions which checks if that
>> document version is updated and process the file only if the version is
>> updated. so crawl the documents but don't process them unless the criteria
>> matches. is there a way to achieve it.
>>
>> Thanks,
>> Jitu
>>
>> On Tue, Oct 7, 2014 at 7:50 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Jitu,
>>>
>>> I know of no way to crawl only those documents that were created after a
>>> specified date.  SharePoint crawling involves walking a tree, not querying
>>> SharePoint for a list of documents that fulfills a specific criteria.
>>>
>>> What this means is that we will need to crawl the entire tree
>>> *regardless* of what documents we decide to index.  We can filter the
>>> discovered documents by looking at their creation date, and exclude those
>>> last modified prior to 2011-01-01 from being indexed.  That would cut down
>>> on the work that your index needs to do, and the work of actually fetching
>>> the content itself.  But we would still need to crawl all documents.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 7, 2014 at 10:11 AM, Jitu <abjitu@gmail.com> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Here is the requirement:
>>>>
>>>> One of our customers would like to selectively publish the documents
>>>> from his SharePoint which is over grown in size in due course. Since
>>>> filtering based on folder names is not an easy task, he likes us to crawl
>>>> all the documents created in sharepoint between 2 dates.
>>>>
>>>>
>>>>
>>>> All documents created/modified between 2011-01-01 till 2013-12-31 are
>>>> needed to crawl and if that is possible to do, then the additional filters
>>>> get added to the date range. Ex: get only the Docx and Doc files created
>>>> between 2011-01-01 to 2013-12-31 etc…
>>>>
>>>>
>>>> similarly all documents created/modified in last 2 months etc...
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Jitu
>>>>
>>>> On Mon, Oct 6, 2014 at 5:04 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Jitu,
>>>>>
>>>>> Did you ever figure out what the customer requirement really was here?
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Oct 3, 2014 at 6:09 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jitu,
>>>>>>
>>>>>> SharePoint does not provide a way to crawl documents by date range,
>>>>>> so all documents will need to be crawled regardless of any date range
>>>>>> requirement, and then filtered.
>>>>>>
>>>>>> So at this point it is important to ask the client if their
>>>>>> requirement's purpose is to save crawling load on the server, because
if it
>>>>>> is, you won't get much savings.  But if the client wants this feature
for
>>>>>> other reasons, we can support it with some work.
>>>>>>
>>>>>> Please open a ticket if you find that the client has a legitimate
>>>>>> reason for this requirement.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> Sent from my Windows Phone
>>>>>> ------------------------------
>>>>>> From: Jitu
>>>>>> Sent: 10/3/2014 4:22 PM
>>>>>> To: user@manifoldcf.apache.org
>>>>>> Subject: regarding crawl parameters
>>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>>  Thanks for your continuous support. we have a requirement from our
>>>>>> client to crawl files which are created/modified in last one month
or 2
>>>>>> months from share point server and that parameter should be configurable
in
>>>>>> gui. we are using manifoldcf 1.7 version. Is there a way to achieve
this.
>>>>>> Please help.
>>>>>>
>>>>>> Thanks,
>>>>>> Jitu
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message