Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6F9D7200C62 for ; Wed, 26 Apr 2017 17:23:37 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6E1EC160BA8; Wed, 26 Apr 2017 15:23:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 409DA160B95 for ; Wed, 26 Apr 2017 17:23:36 +0200 (CEST) Received: (qmail 91249 invoked by uid 500); 26 Apr 2017 15:23:35 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 91238 invoked by uid 99); 26 Apr 2017 15:23:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Apr 2017 15:23:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0A753188BDC for ; Wed, 26 Apr 2017 15:23:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.679 X-Spam-Level: * X-Spam-Status: No, score=1.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id XsUbPrPZsVFp for ; Wed, 26 Apr 2017 15:23:32 +0000 (UTC) Received: from mail-it0-f54.google.com (mail-it0-f54.google.com [209.85.214.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B3F3C5F36B for ; Wed, 26 Apr 2017 15:23:32 +0000 (UTC) Received: by mail-it0-f54.google.com with SMTP id 70so40849403ita.0 for ; Wed, 26 Apr 2017 08:23:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=HcSJhWW5FtX+OmmEttY8Iy9GNb4HYY3rdbuPWbhHGVo=; b=N9aRWPIgfld4JPGvRjMa9JMoH6oe7g0Cu9dfYnRit9KTMEMdwaKWHwtb0yQTO/44w8 HbNb2IcrfHtLRcRP2NpUtLP8sFuFjlCsAitQuc2ZY82RbPilKcVYXUJ+M+fTrbpMCCgK OnxP5zQRNlw0v9KqmnMji8qXfF7ytQ3IeTztsNhP/X6l4LpP5EyBFSDUte9iU9DjNEVB opgp1CDknUo09wR6coLUJuF4FpJzlPDgFIT0YVYRAnFjlBgg/BfDLAYPgOADwzqp2X6A KUSF1xpdjObZVz34eKMr+VH/Oz1jPyDnWpG56JmkkiikbEpKQcOMR/3gD79x63faFOqi EM0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=HcSJhWW5FtX+OmmEttY8Iy9GNb4HYY3rdbuPWbhHGVo=; b=uPKjo4zDbb2emW4vF/YiKVwzK/VtauJoUg4T0uUshsCUc9OaKR/vyHEG0JdqEyGYV3 E/3oQOD3PKCaMndsQSy1xqsaBeUN+5mkHXVRqTvyg21uN6lmsioo+GgZJB+SnQLecmAz mA6K4KyPVsxkF2t75+fhwtQrIAV56JDQg3t3H0dOC19R1jgNw+4BmYbggDxMXM7AGbJh oxl8OYGHkA0oClkWCCnVcC68zQZaCkX/mnLTuML3bZZc1/mOAkUD+EW+21Jal3dVcp8q Gh8t68bY/Fp0WVyilyjwUu3qw7IgErtJNKzcAEsu1WwKYbqhZdUaJTVeaHeoDkPPsyb1 Ok9Q== X-Gm-Message-State: AN3rC/7Q7RDmYLyZzLX2vqf33N3SoIJwsX5sjDZ8EHJyjwQJ6WNZ4p6+ 5600J0nA742J+cYs1avN16BqtaSSsw== X-Received: by 10.36.34.146 with SMTP id o140mr1626224ito.111.1493220212077; Wed, 26 Apr 2017 08:23:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.182.214 with HTTP; Wed, 26 Apr 2017 08:23:31 -0700 (PDT) In-Reply-To: References: From: Karl Wright Date: Wed, 26 Apr 2017 11:23:31 -0400 Message-ID: Subject: Re: Delete IDs with JDBC connector To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a1140542e500f36054e136c1b archived-at: Wed, 26 Apr 2017 15:23:37 -0000 --001a1140542e500f36054e136c1b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable CONNECTORS-1419. Karl On Wed, Apr 26, 2017 at 11:20 AM, Karl Wright wrote: > Oh, never mind. I see the issue, which is that without the version query= , > documents that don't appear in the result list *at all* are never removed > from the map. I'll create a ticket. > > Karl > > > On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright wrote: > >> Hi Julien, >> >> The delete logic in the connector is as follows: >> >> >>>>>> >> // Now, go through the original id's, and see which ones are still i= n >> the map. These >> // did not appear in the result and are presumed to be gone from the >> database, and thus must be deleted. >> for (String documentIdentifier : documentIdentifiers) >> { >> if (fetchDocuments.contains(documentIdentifier)) >> { >> String documentVersion =3D map.get(documentIdentifier); >> if (documentVersion !=3D null) >> { >> // This means we did not see it (or data for it) in the result >> set. Delete it! >> activities.noDocument(documentIdentifier,documentVersion); >> activities.recordActivity(null, ACTIVITY_FETCH, >> null, documentIdentifier, "NOTFETCHED", "Document was not >> seen by processing query", null); >> } >> } >> } >> <<<<<< >> >> For a JDBC job without a version query, fetchDocuments contains all the >> documents. But map has the entries removed that were actually fetched. >> Documents that were *not* fetched for whatever reason therefore will not= be >> cleaned up. Here's the code that determines that: >> >> >>>>>> >> String version =3D map.get(id); >> if (version =3D=3D null) >> // Does not need refetching >> continue; >> >> // This document was marked as "not scan only", so we expect >> to find it. >> if (Logging.connectors.isDebugEnabled()) >> Logging.connectors.debug("JDBC: Document data result found >> for '"+id+"'"); >> o =3D row.getValue(JDBCConstants.urlReturnColumnName); >> if (o =3D=3D null) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' has a >> null url - skipping"); >> errorCode =3D activities.NULL_URL; >> errorDesc =3D "Excluded because document had a null URL"; >> activities.noDocument(id,version); >> continue; >> } >> >> // This is not right - url can apparently be a BinaryInput >> String url =3D JDBCConnection.readAsString(o); >> boolean validURL; >> try >> { >> // Check to be sure url is valid >> new java.net.URI(url); >> validURL =3D true; >> } >> catch (java.net.URISyntaxException e) >> { >> validURL =3D false; >> } >> >> if (!validURL) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' has an >> illegal url: '"+url+"' - skipping"); >> errorCode =3D activities.BAD_URL; >> errorDesc =3D "Excluded because document had illegal URL >> ('"+url+"')"; >> activities.noDocument(id,version); >> continue; >> } >> >> // Process the document itself >> Object contents =3D row.getValue(JDBCConstants.dat >> aReturnColumnName); >> // Null data is allowed; we just ignore these >> if (contents =3D=3D null) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' seems to >> have null data - skipping"); >> errorCode =3D "NULLDATA"; >> errorDesc =3D "Excluded because document had null data"; >> activities.noDocument(id,version); >> continue; >> } >> >> // We will ingest something, so remove this id from the map >> in order that we know what we still >> // need to delete when all done. >> map.remove(id); >> <<<<<< >> >> As you see, activities.noDocument() is called for all cases, except the >> one where the document version is null (which cannot happen since all >> document versions for this case will be the empty string). So I am at a >> loss to understand why the delete is not happening. >> >> The only way I can think of is that if you clicked one of the buttons on >> the output connection's view page that told MCF to "forget" all the hist= ory >> for that connection. >> >> Thanks, >> Karl >> >> >> >> On Wed, Apr 26, 2017 at 10:42 AM, wrote= : >> >>> Hi Karl, >>> >>> I was manually starting the job for test purpose, but even if I schedul= e >>> it with job invocation "Complete" and "Scan every document once", the >>> missing IDs from the database are not deleted in my Solr index (no trac= e of >>> any 'document deletion' event in the history). >>> I should mention that I only use the 'Seeding query' and 'Data query' >>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seed= ing >>> query. >>> >>> Julien >>> >>> Le 26.04.2017 16:05, Karl Wright a =C3=A9crit : >>> >>> Hi Julien, >>> >>> How are you starting the job? If you use "Start minimal", deletion >>> would not take place. If your job is a continuous one, this is also th= e >>> case. >>> >>> Thanks, >>> Karl >>> >>> On Wed, Apr 26, 2017 at 9:52 AM, wrote= : >>> >>>> Hi the MCF community, >>>> >>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database >>>> and index the data into a Solr server, and it works very well. However= , >>>> when I perform a delta re-crawl, the new IDs are correctly retrieved f= rom >>>> the Database but those who have been deleted are not "detected" by the >>>> connector and thus, are still present in my Solr index. >>>> I would like to know if normally it should work and that I maybe have >>>> missed something in the configuration of the job, or if this is not >>>> implemented ? >>>> The only way I found to solve this issue is to reset the seeding of th= e >>>> job, but it is very time and resource consuming. >>>> >>>> Best regards, >>>> Julien Massiera >>> >>> >>> >> > --001a1140542e500f36054e136c1b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
CONNECTORS-1419.
Karl

On Wed, Apr 26, 2017 at 11:20 AM, Karl= Wright <daddywri@gmail.com> wrote:
Oh, never mind.=C2=A0 I see the issue, which= is that without the version query, documents that don't appear in the = result list *at all* are never removed from the map.=C2=A0 I'll create = a ticket.

Karl


On Wed= , Apr 26, 2017 at 11:10 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Julien,
=
The delete logic in the connector is as follows:

>>>>>>
=C2=A0 =C2=A0 // Now, go th= rough the original id's, and see which ones are still in the map.=C2=A0= These
=C2=A0 =C2=A0 // did not appear in the result and are pres= umed to be gone from the database, and thus must be deleted.
=C2= =A0 =C2=A0 for (String documentIdentifier : documentIdentifiers)
= =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 if (fetchDocuments.contains(= documentIdentifier))
=C2=A0 =C2=A0 =C2=A0 {
=C2=A0= =C2=A0 =C2=A0 =C2=A0 String documentVersion =3D map.get(documentIdentifier= );
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (documentVersion !=3D null)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 // This means we did not see it (or data for it) in the result set.= =C2=A0 Delete it!
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 activities.n= oDocument(documentIdentifier,documentVersion);
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 activities.recordActivity(null, ACTIVITY_FETCH,<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 null, documentIdentifie= r, "NOTFETCHED", "Document was not seen by processing query&= quot;, null);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2= =A0 =C2=A0 }
=C2=A0 =C2=A0 }
<<<<<= <

For a JDBC job without a version query, fetch= Documents contains all the documents.=C2=A0 But map has the entries removed= that were actually fetched.=C2=A0 Documents that were *not* fetched for wh= atever reason therefore will not be cleaned up.=C2=A0 Here's the code t= hat determines that:
=C2=A0
>>>>>>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 String version =3D m= ap.get(id);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (version= =3D=3D null)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 //= Does not need refetching
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 // This document was marked as "not scan only", so = we expect to find it.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i= f (Logging.connectors.isDebugEnabled())
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 Logging.connectors.debug("JDBC: Docu= ment data result found for '"+id+"'");
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 o =3D row.getValue(JDBCConstants.url= ReturnColumnName);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= if (o =3D=3D null)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Logging.connector= s.debug("JDBC: Document '"+id+"' has a null url= - skipping");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 errorCode =3D activities.NULL_URL;
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 errorDesc =3D "Excluded because document had = a null URL";
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 activities.noDocument(id,version);
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 continue;
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // This is not right - = url can apparently be a BinaryInput
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 String url =3D JDBCConnection.readAsString(o);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 boolean validURL;
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 try
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 // Check to be sure url is valid
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 new java.net.URI(url);
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 validURL =3D true;
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 catch (java.net.URISyntaxException e)
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 validURL =3D false;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= if (!validURL)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Logging.connectors.de= bug("JDBC: Document '"+id+"' has an illegal url= : '"+url+"' - skipping");
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 errorCode =3D activities.BAD_URL;
<= div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 errorDesc =3D "Ex= cluded because document had illegal URL ('"+url+"')"= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 activities.noD= ocument(id,version);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // Process the document itself
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Object contents =3D row.getValue= (JDBCConstants.dataReturnColumnName);
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 // Null data is allowed; we just ignore these
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (contents =3D=3D null)
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Logging.connectors.debug("JDBC= : Document '"+id+"' seems to have null data - skipping&qu= ot;);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 errorCode = =3D "NULLDATA";
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 errorDesc =3D "Excluded because document had null data"= ;;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 activities.no= Document(id,version);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // We will ingest something, so remove t= his id from the map in order that we know what we still
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // need to delete when all done.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 map.remove(id);
<<<<<<

As you see, activities.= noDocument() is called for all cases, except the one where the document ver= sion is null (which cannot happen since all document versions for this case= will be the empty string).=C2=A0 So I am at a loss to understand why the d= elete is not happening.

The only way I can t= hink of is that if you clicked one of the buttons on the output connection&= #39;s view page that told MCF to "forget" all the history for tha= t connection.

Thanks,
Karl


On Wed, Apr 26, 2017 at 10:42 AM, &l= t;julie= n.massiera@francelabs.com> wrote:

Hi Karl,

I was manually starting the job for test purpose, but even if I schedule= it with job invocation "Complete" and "Scan every document = once", the missing IDs from the database are not deleted in my Solr in= dex (no trace of any 'document deletion' event in the history).
= I should mention that I only use the 'Seeding query' and 'Data = query' and I am not using the $(STARTTIME) and $(ENDTIME) variables in = my seeding query.

Julien

Le 26.04.2017 16:05, Karl Wright a =C3=A9crit=C2=A0:

Hi Julien,
=C2=A0
How are you starting the job?=C2=A0 If you use "Start minimal&quo= t;, deletion would not take place.=C2=A0 If your job is a continuous one, t= his is also the case.
=C2=A0
Thanks,
Karl

On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi= era@francelabs.com> wrote:
Hi the MCF community,

I am using MC= F 2.6 with the JDBC connector to crawl an Oracle Database and index the dat= a into a Solr server, and it works very well. However, when I perform a del= ta re-crawl, the new IDs are correctly retrieved from the Database but thos= e who have been deleted are not "detected" by the connector and t= hus, are still present in my Solr index.
I would like to know if normal= ly it should work and that I maybe have missed something in the configurati= on of the job, or if this is not implemented ?
The only way I found to = solve this issue is to reset the seeding of the job, but it is very time an= d resource consuming.

Best regards,
Julien Massiera





--001a1140542e500f36054e136c1b--