manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From julien.massi...@francelabs.com
Subject Re: Delete IDs with JDBC connector
Date Thu, 27 Apr 2017 10:32:21 GMT
Hi Karl, 

yes your fix works. However, doesn't it break the logic of the delta
feature provided by the seeding query that makes good use of the
$(STARTTIME) and $(ENDTIME) variables ? 

For example, let assume that the docs in my database have a timestamp
that indicates their last modification date, if I set the following
'Seeding query': 

Select doc.id as "$(IDCOLUMN)"
>From doctable doc
Where doc.lastmod > $(STARTTIME) 

the advantage is that the first crawl will retrieve all my docs from the
database and the next ones will only retrieve those that are new or have
been modified since the last crawl.

Now if I combine that with a 'Version check query', each execution of
the job will also check the version of all the crawled docs since the
very first crawl, and delete those that have disappeared from the
database. 

I think that with your modification, this logic is completely broken
because during a 'delta' crawl, all the docs that where crawled and that
does not appear in the delta will be deleted, despite that they may
still be present in the database.
I would just change your fix to only make use of the 'seenDocuments'
condition when the $(STARTTIME) and $(ENDTIME) variables are not present
in the 'Seeding query' and the 'Version check query' is empty 

What do you think ? 

Anyway thanks for your quick fix,
Julien 

Le 26.04.2017 19:12, Karl Wright a écrit :

> I committed a fix to trunk, and also uploaded a patch to the ticket.  Please let me know
if it works for you. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:24 AM, <julien.massiera@francelabs.com> wrote:
> 
> Oh OK so I finally don't have to investigate :)
> 
> Thanks Karl ! 
> 
> Julien
> 
> Le 26.04.2017 17:20, Karl Wright a écrit : 
> Oh, never mind.  I see the issue, which is that without the version query, documents
that don't appear in the result list *at all* are never removed from the map.  I'll create
a ticket. 
> 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddywri@gmail.com> wrote:
> 
> Hi Julien, 
> 
> The delete logic in the connector is as follows: 
> 
>>>>>>> 
> 
> // Now, go through the original id's, and see which ones are still in the map.  These

> // did not appear in the result and are presumed to be gone from the database, and thus
must be deleted. 
> for (String documentIdentifier : documentIdentifiers) 
> { 
> if (fetchDocuments.contains(documentIdentifier)) 
> { 
> String documentVersion = map.get(documentIdentifier); 
> if (documentVersion != null) 
> { 
> // This means we did not see it (or data for it) in the result set.  Delete it! 
> activities.noDocument(documentIdentifier,documentVersion); 
> activities.recordActivity(null, ACTIVITY_FETCH, 
> null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing query",
null); 
> } 
> } 
> } 
> <<<<<< 
> 
> For a JDBC job without a version query, fetchDocuments contains all the documents.  But
map has the entries removed that were actually fetched.  Documents that were *not* fetched
for whatever reason therefore will not be cleaned up.  Here's the code that determines that:

> 
>>>>>>> 
> 
> String version = map.get(id); 
> if (version == null) 
> // Does not need refetching 
> continue; 
> 
> // This document was marked as "not scan only", so we expect to find it. 
> if (Logging.connectors.isDebugEnabled()) 
> Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); 
> o = row.getValue(JDBCConstants.urlReturnColumnName); 
> if (o == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - skipping"); 
> errorCode = activities.NULL_URL; 
> errorDesc = "Excluded because document had a null URL"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // This is not right - url can apparently be a BinaryInput 
> String url = JDBCConnection.readAsString(o); 
> boolean validURL; 
> try 
> { 
> // Check to be sure url is valid 
> new java.net.URI(url); 
> validURL = true; 
> } 
> catch (java.net.URISyntaxException e) 
> { 
> validURL = false; 
> } 
> 
> if (!validURL) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: '"+url+"' - skipping");

> errorCode = activities.BAD_URL; 
> errorDesc = "Excluded because document had illegal URL ('"+url+"')"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // Process the document itself 
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName); 
> // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - skipping");

> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we know what
we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one where the
document version is null (which cannot happen since all document versions for this case will
be the empty string).  So I am at a loss to understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the output connection's
view page that told MCF to "forget" all the history for that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massiera@francelabs.com> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it with job
invocation "Complete" and "Scan every document once", the missing IDs from the database are
not deleted in my Solr index (no trace of any 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I am not using
the $(STARTTIME) and $(ENDTIME) variables in my seeding query. 
> 
> Julien
> 
> Le 26.04.2017 16:05, Karl Wright a écrit : 
> Hi Julien, 
> 
> How are you starting the job?  If you use "Start minimal", deletion would not take place.
 If your job is a continuous one, this is also the case. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massiera@francelabs.com> wrote:
> Hi the MCF community,
> 
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and index the
data into a Solr server, and it works very well. However, when I perform a delta re-crawl,
the new IDs are correctly retrieved from the Database but those who have been deleted are
not "detected" by the connector and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed something
in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job, but it is
very time and resource consuming.
> 
> Best regards,
> Julien Massiera
Mime
View raw message