manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Delete IDs with JDBC connector
Date Wed, 26 Apr 2017 15:23:31 GMT
CONNECTORS-1419.
Karl

On Wed, Apr 26, 2017 at 11:20 AM, Karl Wright <daddywri@gmail.com> wrote:

> Oh, never mind.  I see the issue, which is that without the version query,
> documents that don't appear in the result list *at all* are never removed
> from the map.  I'll create a ticket.
>
> Karl
>
>
> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Julien,
>>
>> The delete logic in the connector is as follows:
>>
>> >>>>>>
>>     // Now, go through the original id's, and see which ones are still in
>> the map.  These
>>     // did not appear in the result and are presumed to be gone from the
>> database, and thus must be deleted.
>>     for (String documentIdentifier : documentIdentifiers)
>>     {
>>       if (fetchDocuments.contains(documentIdentifier))
>>       {
>>         String documentVersion = map.get(documentIdentifier);
>>         if (documentVersion != null)
>>         {
>>           // This means we did not see it (or data for it) in the result
>> set.  Delete it!
>>           activities.noDocument(documentIdentifier,documentVersion);
>>           activities.recordActivity(null, ACTIVITY_FETCH,
>>             null, documentIdentifier, "NOTFETCHED", "Document was not
>> seen by processing query", null);
>>         }
>>       }
>>     }
>> <<<<<<
>>
>> For a JDBC job without a version query, fetchDocuments contains all the
>> documents.  But map has the entries removed that were actually fetched.
>> Documents that were *not* fetched for whatever reason therefore will not be
>> cleaned up.  Here's the code that determines that:
>>
>> >>>>>>
>>             String version = map.get(id);
>>             if (version == null)
>>               // Does not need refetching
>>               continue;
>>
>>             // This document was marked as "not scan only", so we expect
>> to find it.
>>             if (Logging.connectors.isDebugEnabled())
>>               Logging.connectors.debug("JDBC: Document data result found
>> for '"+id+"'");
>>             o = row.getValue(JDBCConstants.urlReturnColumnName);
>>             if (o == null)
>>             {
>>               Logging.connectors.debug("JDBC: Document '"+id+"' has a
>> null url - skipping");
>>               errorCode = activities.NULL_URL;
>>               errorDesc = "Excluded because document had a null URL";
>>               activities.noDocument(id,version);
>>               continue;
>>             }
>>
>>             // This is not right - url can apparently be a BinaryInput
>>             String url = JDBCConnection.readAsString(o);
>>             boolean validURL;
>>             try
>>             {
>>               // Check to be sure url is valid
>>               new java.net.URI(url);
>>               validURL = true;
>>             }
>>             catch (java.net.URISyntaxException e)
>>             {
>>               validURL = false;
>>             }
>>
>>             if (!validURL)
>>             {
>>               Logging.connectors.debug("JDBC: Document '"+id+"' has an
>> illegal url: '"+url+"' - skipping");
>>               errorCode = activities.BAD_URL;
>>               errorDesc = "Excluded because document had illegal URL
>> ('"+url+"')";
>>               activities.noDocument(id,version);
>>               continue;
>>             }
>>
>>             // Process the document itself
>>             Object contents = row.getValue(JDBCConstants.dat
>> aReturnColumnName);
>>             // Null data is allowed; we just ignore these
>>             if (contents == null)
>>             {
>>               Logging.connectors.debug("JDBC: Document '"+id+"' seems to
>> have null data - skipping");
>>               errorCode = "NULLDATA";
>>               errorDesc = "Excluded because document had null data";
>>               activities.noDocument(id,version);
>>               continue;
>>             }
>>
>>             // We will ingest something, so remove this id from the map
>> in order that we know what we still
>>             // need to delete when all done.
>>             map.remove(id);
>> <<<<<<
>>
>> As you see, activities.noDocument() is called for all cases, except the
>> one where the document version is null (which cannot happen since all
>> document versions for this case will be the empty string).  So I am at a
>> loss to understand why the delete is not happening.
>>
>> The only way I can think of is that if you clicked one of the buttons on
>> the output connection's view page that told MCF to "forget" all the history
>> for that connection.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massiera@francelabs.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I was manually starting the job for test purpose, but even if I schedule
>>> it with job invocation "Complete" and "Scan every document once", the
>>> missing IDs from the database are not deleted in my Solr index (no trace of
>>> any 'document deletion' event in the history).
>>> I should mention that I only use the 'Seeding query' and 'Data query'
>>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>>> query.
>>>
>>> Julien
>>>
>>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>>
>>> Hi Julien,
>>>
>>> How are you starting the job?  If you use "Start minimal", deletion
>>> would not take place.  If your job is a continuous one, this is also the
>>> case.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massiera@francelabs.com> wrote:
>>>
>>>> Hi the MCF community,
>>>>
>>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>>> and index the data into a Solr server, and it works very well. However,
>>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>>> the Database but those who have been deleted are not "detected" by the
>>>> connector and thus, are still present in my Solr index.
>>>> I would like to know if normally it should work and that I maybe have
>>>> missed something in the configuration of the job, or if this is not
>>>> implemented ?
>>>> The only way I found to solve this issue is to reset the seeding of the
>>>> job, but it is very time and resource consuming.
>>>>
>>>> Best regards,
>>>> Julien Massiera
>>>
>>>
>>>
>>
>

Mime
View raw message