I committed a fix to trunk, and also uploaded a patch to the ticket.  Please let me know if it works for you.

Thanks,
Karl


On Wed, Apr 26, 2017 at 11:24 AM, <julien.massiera@francelabs.com> wrote:

Oh OK so I finally don't have to investigate :)

Thanks Karl !

Julien

Le 26.04.2017 17:20, Karl Wright a écrit :

Oh, never mind.  I see the issue, which is that without the version query, documents that don't appear in the result list *at all* are never removed from the map.  I'll create a ticket.
 
Karl
 

On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Julien,
 
The delete logic in the connector is as follows:
 
>>>>>>
    // Now, go through the original id's, and see which ones are still in the map.  These
    // did not appear in the result and are presumed to be gone from the database, and thus must be deleted.
    for (String documentIdentifier : documentIdentifiers)
    {
      if (fetchDocuments.contains(documentIdentifier))
      {
        String documentVersion = map.get(documentIdentifier);
        if (documentVersion != null)
        {
          // This means we did not see it (or data for it) in the result set.  Delete it!
          activities.noDocument(documentIdentifier,documentVersion);
          activities.recordActivity(null, ACTIVITY_FETCH,
            null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing query", null);
        }
      }
    }
<<<<<<
 
For a JDBC job without a version query, fetchDocuments contains all the documents.  But map has the entries removed that were actually fetched.  Documents that were *not* fetched for whatever reason therefore will not be cleaned up.  Here's the code that determines that:
 
>>>>>>
            String version = map.get(id);
            if (version == null)
              // Does not need refetching
              continue;
 
            // This document was marked as "not scan only", so we expect to find it.
            if (Logging.connectors.isDebugEnabled())
              Logging.connectors.debug("JDBC: Document data result found for '"+id+"'");
            o = row.getValue(JDBCConstants.urlReturnColumnName);
            if (o == null)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - skipping");
              errorCode = activities.NULL_URL;
              errorDesc = "Excluded because document had a null URL";
              activities.noDocument(id,version);
              continue;
            }
            
            // This is not right - url can apparently be a BinaryInput
            String url = JDBCConnection.readAsString(o);
            boolean validURL;
            try
            {
              // Check to be sure url is valid
              new java.net.URI(url);
              validURL = true;
            }
            catch (java.net.URISyntaxException e)
            {
              validURL = false;
            }
 
            if (!validURL)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: '"+url+"' - skipping");
              errorCode = activities.BAD_URL;
              errorDesc = "Excluded because document had illegal URL ('"+url+"')";
              activities.noDocument(id,version);
              continue;
            }
            
            // Process the document itself
            Object contents = row.getValue(JDBCConstants.dataReturnColumnName);
            // Null data is allowed; we just ignore these
            if (contents == null)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - skipping");
              errorCode = "NULLDATA";
              errorDesc = "Excluded because document had null data";
              activities.noDocument(id,version);
              continue;
            }
            
            // We will ingest something, so remove this id from the map in order that we know what we still
            // need to delete when all done.
            map.remove(id);
<<<<<<
 
As you see, activities.noDocument() is called for all cases, except the one where the document version is null (which cannot happen since all document versions for this case will be the empty string).  So I am at a loss to understand why the delete is not happening.
 
The only way I can think of is that if you clicked one of the buttons on the output connection's view page that told MCF to "forget" all the history for that connection.
 
Thanks,
Karl
 
 

On Wed, Apr 26, 2017 at 10:42 AM, <julien.massiera@francelabs.com> wrote:

Hi Karl,

I was manually starting the job for test purpose, but even if I schedule it with job invocation "Complete" and "Scan every document once", the missing IDs from the database are not deleted in my Solr index (no trace of any 'document deletion' event in the history).
I should mention that I only use the 'Seeding query' and 'Data query' and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query.

Julien

Le 26.04.2017 16:05, Karl Wright a écrit :

Hi Julien,
 
How are you starting the job?  If you use "Start minimal", deletion would not take place.  If your job is a continuous one, this is also the case.
 
Thanks,
Karl

On Wed, Apr 26, 2017 at 9:52 AM, <julien.massiera@francelabs.com> wrote:
Hi the MCF community,

I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and index the data into a Solr server, and it works very well. However, when I perform a delta re-crawl, the new IDs are correctly retrieved from the Database but those who have been deleted are not "detected" by the connector and thus, are still present in my Solr index.
I would like to know if normally it should work and that I maybe have missed something in the configuration of the job, or if this is not implemented ?
The only way I found to solve this issue is to reset the seeding of the job, but it is very time and resource consuming.

Best regards,
Julien Massiera