Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12CCFDF3E for ; Tue, 9 Oct 2012 20:28:51 +0000 (UTC) Received: (qmail 79413 invoked by uid 500); 9 Oct 2012 20:28:50 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 79371 invoked by uid 500); 9 Oct 2012 20:28:50 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 79361 invoked by uid 99); 9 Oct 2012 20:28:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 20:28:50 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 20:28:42 +0000 Received: by mail-ie0-f178.google.com with SMTP id e11so11950819iej.9 for ; Tue, 09 Oct 2012 13:28:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ona5EdKYt6wjmsKJA/6gX+1Ds497JKKvckRb4BssiSM=; b=Ci/dxHU8XRPCo6qYLc4s30R+kRaZz6W9JX+OQMXx72VWjBp2RkZjhqJJ9MJpRgf3fV DwKpBKnH3TWpe6wsBEURvflimT4zfFCVyemrfks9Vm6+nrW+Kc6scOBw4NFYG0oWMP+Q otM4ooE+UpY9wTBm0d8LJifi0YfJBXifFRKAtPVU7qm5pW/Rujh84N8TDsSdfuyOGP7A /1cRTiICA5kfh8ZSJrw5+NBkmlg3R8fwSuv7AYMb70SfU0BjNTnRb8H0wEmCxhR/yRDv J7shwbUjs+dpjCwZ4zzvyOXw/33TFztQCXiwHhmozfL9/6wKFwN9bKMWdqM2D/h98o4I cI/A== MIME-Version: 1.0 Received: by 10.50.178.67 with SMTP id cw3mr3001628igc.53.1349814501237; Tue, 09 Oct 2012 13:28:21 -0700 (PDT) Received: by 10.43.5.67 with HTTP; Tue, 9 Oct 2012 13:28:21 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Oct 2012 16:28:21 -0400 Message-ID: Subject: Re: Problem with continuous jobs deleting their documents on restart of Agent From: Karl Wright To: user@manifoldcf.apache.org Content-Type: text/plain; charset=ISO-8859-1 I've checked a fix for this into trunk. See the ticket for details. Please try trunk in your environment to be sure everything looks good for you now. I think it may be a serious enough problem to force a 1.01 patch release. Comments? Karl On Tue, Oct 9, 2012 at 9:36 AM, Karl Wright wrote: > CONNECTORS-551 > > FWIW, I will not be able to look at this for another few hours, most likely. > > Karl > > > On Tue, Oct 9, 2012 at 9:31 AM, Karl Wright wrote: >> Hi Martin, >> >> FWIW, the agents startup sequence also does not have logic which >> deletes documents or jobs. >> >> Nevertheless I will create a ticket and have a look at this ASAP. >> >> Karl >> >> On Tue, Oct 9, 2012 at 9:25 AM, Martin Gielow wrote: >>> I have just completed testing the behaviour on the unaltered >>> multiprocess-example using the provided HSQL instance. >>> >>> Indeed, when using the file system connector, Manifold works as it should. >>> The agent can be stopped and restarted and the previously processed >>> documents are retained. When I tried the JDBC (pointed to a MySQL DB) and >>> Wiki connectors, however, I received the same results as yesterday - all >>> documents are deleted as soon as the agent restarts (not on shutdown but >>> when running the agent again after it has been stopped). >>> >>> For the JDBC connector I could imagine that this may somehow be related to >>> flawed seeding or version queries (although I believe them to be ok), but in >>> the case of Wiki there are hardly any settings I believe I could have gotten >>> wrong. >>> >>> >>> On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright wrote: >>>> >>>> I just tried this; the experiment yields no document deletions >>>> recorded in the simple history (as expected). >>>> >>>> So clearly there is a complicating factor somewhere that you will need to >>>> find. >>>> >>>> I would suggest going about the basic process of eliminating >>>> variables. For example, try a continuous crawl in your environment >>>> using the file system connector on a moderately-sized set of sample >>>> documents, and see if it seems to do the same thing as the other >>>> connectors you are using. If it does, then that would suggest that >>>> one of your modifications was in fact causing the problem. If not, >>>> then I should look at trying to repeat the experiment here with one of >>>> the connectors you are working with. >>>> >>>> Thanks, >>>> Karl >>>> >>>> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright wrote: >>>> > There is no logic whatsoever in agents-shutdown that should delete >>>> > documents from the queue and from the index, and I have never seen >>>> > this behavior before, but this is really easy to verify. It should be >>>> > simple to take an unaltered 1.0 distribution, create a filesystem job >>>> > on the multiprocess example, start it crawling continuously, then stop >>>> > and restart the agents process, and then look at the simple history to >>>> > see whether any documents get deleted or not. I may have time to try >>>> > this later in the evening, we'll see. >>>> > >>>> > Karl >>>> > >>>> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow >>>> > wrote: >>>> >> Hi Karl, >>>> >> >>>> >> thanks for the lightning-speed reply! :) >>>> >> >>>> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright wrote: >>>> >>> >>>> >>> Hi Martin, >>>> >>> >>>> >>> The behavior you describe is expected only if you are either deleting >>>> >>> the job, or the job is set to expire old documents after a certain >>>> >>> time interval (and that interval has transpired). >>>> >>> >>>> >>> Can you tell me what your expiration interval is? >>>> >>> >>>> >> >>>> >> The expiration interval is set to 1440 (minutes, according to the >>>> >> interface). I also just tried to leave the box empty, so that there >>>> >> should >>>> >> be no expiration, but the behaviour remained the same. >>>> >> >>>> >>> >>>> >>> Also, when you say "shutting down agents process", can you clarify >>>> >>> what deployment model you are using? How are you shutting down this >>>> >>> process? >>>> >> >>>> >> >>>> >> I am using a slightly modified version of the multiprocess-example with >>>> >> postgres as the DBMS. To run and shutdown the agents I use the batch >>>> >> files >>>> >> that are provided with the example (start-agents.bat and >>>> >> stop-agents.bat). >>>> >> I have also tried to run the agents process from Eclipse to be able to >>>> >> debug >>>> >> into it and was getting the same results. >>>> >> >>>> >>> >>>> >>> Thanks, >>>> >>> Karl >>>> >> >>>> >> >>>> >> Regards, >>>> >> Martin >>>> >> >>>> >> >>>> >>> >>>> >>> >>>> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow >>>> >>> >>>> >>> wrote: >>>> >>> > Hello, >>>> >>> > >>>> >>> > I'm using Manifold to crawl several data sources using the Wiki and >>>> >>> > the >>>> >>> > JDBC >>>> >>> > connectors. I have set the associated jobs to run continuously so >>>> >>> > that >>>> >>> > new >>>> >>> > documents will be added in a timely manner. The problem I am having >>>> >>> > with >>>> >>> > this, is that whenever the Agent is stopped and then restarted, the >>>> >>> > jobs >>>> >>> > will delete all of their documents (also propagating the deletes to >>>> >>> > the >>>> >>> > associated output connection) before turning themselves inactive >>>> >>> > (which >>>> >>> > they >>>> >>> > shouldn't as they are set to run continuously). >>>> >>> > >>>> >>> > If I then restart the job, in case of the JDBC connection, it is not >>>> >>> > finding >>>> >>> > any previously added documents and will set itself inactive again. >>>> >>> > In >>>> >>> > case >>>> >>> > of the Wiki connection, the documents are also deleted, but are >>>> >>> > successfully >>>> >>> > reindexed when the job is restartet manually. >>>> >>> > >>>> >>> > The only way I found to prevent the jobs from deleting their items >>>> >>> > in >>>> >>> > this >>>> >>> > case, was to manually stop the affected jobs before the Agent is >>>> >>> > stopped >>>> >>> > (using the abort option) and to restart them after the Agent has >>>> >>> > been >>>> >>> > restarted. >>>> >>> > >>>> >>> > >>>> >>> > I am using the 1.0 release of Manifold and couldn't find anything >>>> >>> > regarding >>>> >>> > this behaviour in either the documentation or the wiki. >>>> >>> > >>>> >>> > Is there an obvious flaw with my setup or something I may have >>>> >>> > missed in >>>> >>> > the >>>> >>> > configuration? >>>> >>> > >>>> >>> > Thanks in advance for any tips! >>>> >>> > >>>> >>> > Regards, >>>> >>> > Martin >>>> >> >>>> >> >>> >>>