manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Massiera <julien.massi...@francelabs.com>
Subject How documents are deleted
Date Wed, 24 Oct 2018 09:26:08 GMT
Hi Karl,

I am trying to understand the behavior of ManifoldCF during a re-crawl 
and specially how missing documents are deleted and by which process ?

I am focusing on two repository connectors, the JCIFS one and the JDBC 
one. Here is what I understand so far :

In the JCIFS connector, the addSeedDocuments method list all the files 
found for each configured path. So it seems clear that any previously 
crawled files that have not been listed during a re-crawl by this method 
should be deleted.

In the JDBC connector, the addSeedDocuments method only list the new or 
modified documents during a re-crawl (if, of course, the id query is 
correctly using the starttime and endtime variables). So here, there is 
a difference between the two connectors. It means that to delete missing 
documents, the previously crawled ones need to be 'checked' with the 
version query to detect the documents that must be removed.

I am currently unable to tell what is really performed by ManifoldCF to 
deal with documents to delete and if any of the assumptions I exposed 
above are correct and/or used. Also, I am really interested to know 
which part of the code is performing the delete process.

Thanks for your help.

-- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC
www.francelabs.com


Mime
View raw message