manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: new documents
Date Tue, 12 Feb 2013 17:02:05 GMT
The crawl it does depends on the model the connector uses.  If the
connector does not get informed about deletes, then the connector is
*forced* to find them by checking each document to see if it went
away.  Most repositories do not have this capability, I'm afraid.

There are some ways you can get around this.  The best is by using a
continuous crawl with expiration.  ManifoldCF will then just do some
of the crawl during each time window it is given and never try to
clean up dead documents, other than by expiring them.  You can read
more about the various crawl models in ManifoldCF in Action.

Karl


On Tue, Feb 12, 2013 at 11:55 AM, Mark Lugert <mlugert@yahoo.com> wrote:
> If you have 3 million documents then each time you run a crawl it will check
> each document that matches your query correct?
>
> Just want to make sure I understand.  That could really take a lot of time.
>
> Wouldn't it be better to store a last crawled date and then limit the query
> based on that date so your only indexing things the repo server says have
> changed?  The current method seems better suited to things like
> websites/wikis where you can't really query based on modified dates.
>
> -mark
>
> From: Karl Wright <daddywri@gmail.com>
> To: user@manifoldcf.apache.org; Mark Lugert <mlugert@yahoo.com>
> Sent: Monday, February 11, 2013 5:10 PM
> Subject: Re: new documents
>
> Actually, it doesn't reindex everything.  It only reindexes those
> documents that have "changed", using the connector's idea of what that
> means.  For SharePoint, it's the modify date, for Alfresco and CMIS I
> don't know but others on this list might.
>
> Also, don't confuse rechecking with reindexing.  ManifoldCF *will*
> need to scan through the documents in many cases, but it will do a
> minimal amount of work for each one.
>
> Karl
>
> On Mon, Feb 11, 2013 at 3:35 PM, Mark Lugert <mlugert@yahoo.com> wrote:
>> Hi Karl,
>>
>> If  I use the sharepoint, alfresco, or cmis repo connectors how can I make
>> it only index new documents that match my queries?
>>
>> Right now I'm seeing it reindex everything that matches my query every
>> time
>> the job runs.
>>
>> I have it set to scan all documents once, but still rescans everything
>> every
>> time I start the job.  Is this a config issue on my part?
>>
>> thanks,
>> Mark
>
>

Mime
View raw message