Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 820E21154B for ; Fri, 15 Aug 2014 17:57:18 +0000 (UTC) Received: (qmail 67280 invoked by uid 500); 15 Aug 2014 17:57:18 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 67224 invoked by uid 500); 15 Aug 2014 17:57:18 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 67211 invoked by uid 99); 15 Aug 2014 17:57:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Aug 2014 17:57:18 +0000 Date: Fri, 15 Aug 2014 17:57:18 +0000 (UTC) From: "Karl Wright (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1009) Cmis Repository Connector does not handle Document updating properly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098837#comment-14098837 ] Karl Wright commented on CONNECTORS-1009: ----------------------------------------- Hi Prasad, Please read this entry: https://chemistry.apache.org/java/0.9.0/maven/apidocs/org/apache/chemistry/opencmis/client/api/Session.html#query%28java.lang.String,%20boolean%29 Note that we call the session.query() method as follows: {code} ItemIterable results = session.query(cmisQuery, false).getPage(1000000000); {code} Note the "false" second argument, which if I read this right *should* cause the seed query to return only the latest versions. So, in theory, if you remove the document = document.getObjectOfLatestVersion() invocation, the connector should work. Please also note that you perform the full crawl equivalent of continuous crawling by just setting up a a set of schedule windows, and making sure you turn off the requirement that crawls only ever begin at the start of a window. > Cmis Repository Connector does not handle Document updating properly > -------------------------------------------------------------------- > > Key: CONNECTORS-1009 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1009 > Project: ManifoldCF > Issue Type: Bug > Components: CMIS connector > Affects Versions: ManifoldCF 1.7 > Reporter: Prasad Perera > Priority: Minor > Fix For: ManifoldCF 1.7 > > Attachments: std_logs.txt, std_prints.diff > > > As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector does not handle document updating properly. > Case Scenario: > * Create a continuous crawling job using CmisRepositoryConnector. > * Update a document on repository end. > * The document keep submitting to OutputConnector at each crawling interval though it was not updated afterwards. > One possible Fix needed I is : @ CmisRepositoryConnector:processDocument, > activities.ingestDocumentWithException(nodeId, version, documentURI, rd); > The documentURI should point to the old document URI (Now it points to the latest documentURI discovered and it may seems to confuse document references ?) > Also, In ECM systems, for example in Alfresco, the documentIDs are formulated with the version number as well. > Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 --> version 1.0 > workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version 1.1 > When we setup a query to crawl a repository folder, we discover content by referring the child nodes. Because of that, now it seems to queue all the document versions and submit them to OutputConnector thus producing duplicate documents at the output (search) side. > Is there a way to avoid this problem ? It will be great if the repository can just take the latest document version and submit it as an update. -- This message was sent by Atlassian JIRA (v6.2#6252)