manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Gielow <martin.gie...@gmail.com>
Subject Re: Problem with continuous jobs deleting their documents on restart of Agent
Date Tue, 09 Oct 2012 13:25:31 GMT
I have just completed testing the behaviour on the unaltered
multiprocess-example using the provided HSQL instance.

Indeed, when using the file system connector, Manifold works as it should.
The agent can be stopped and restarted and the previously processed
documents are retained. When I tried the JDBC (pointed to a MySQL DB) and
Wiki connectors, however, I received the same results as yesterday - all
documents are deleted as soon as the agent restarts (not on shutdown but
when running the agent again after it has been stopped).

For the JDBC connector I could imagine that this may somehow be related to
flawed seeding or version queries (although I believe them to be ok), but
in the case of Wiki there are hardly any settings I believe I could have
gotten wrong.


On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <daddywri@gmail.com> wrote:

> I just tried this; the experiment yields no document deletions
> recorded in the simple history (as expected).
>
> So clearly there is a complicating factor somewhere that you will need to
> find.
>
> I would suggest going about the basic process of eliminating
> variables.  For example, try a continuous crawl in your environment
> using the file system connector on a moderately-sized set of sample
> documents, and see if it seems to do the same thing as the other
> connectors you are using.  If it does, then that would suggest that
> one of your modifications was in fact causing the problem.  If not,
> then I should look at trying to repeat the experiment here with one of
> the connectors you are working with.
>
> Thanks,
> Karl
>
> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
> > There is no logic whatsoever in agents-shutdown that should delete
> > documents from the queue and from the index, and I have never seen
> > this behavior before, but this is really easy to verify.  It should be
> > simple to take an unaltered 1.0 distribution, create a filesystem job
> > on the multiprocess example, start it crawling continuously, then stop
> > and restart the agents process, and then look at the simple history to
> > see whether any documents get deleted or not.  I may have time to try
> > this later in the evening, we'll see.
> >
> > Karl
> >
> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <martin.gielow@gmail.com>
> wrote:
> >> Hi Karl,
> >>
> >> thanks for the lightning-speed reply! :)
> >>
> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>>
> >>> Hi Martin,
> >>>
> >>> The behavior you describe is expected only if you are either deleting
> >>> the job, or the job is set to expire old documents after a certain
> >>> time interval (and that interval has transpired).
> >>>
> >>> Can you tell me what your expiration interval is?
> >>>
> >>
> >> The expiration interval is set to 1440 (minutes, according to the
> >> interface). I also just tried to leave the box empty, so that there
> should
> >> be no expiration, but the behaviour remained the same.
> >>
> >>>
> >>> Also, when you say "shutting down agents process", can you clarify
> >>> what deployment model you are using?  How are you shutting down this
> >>> process?
> >>
> >>
> >> I am using a slightly modified version of the multiprocess-example with
> >> postgres as the DBMS. To run and shutdown the agents I use the batch
> files
> >> that are provided with the example (start-agents.bat and
> stop-agents.bat).
> >> I have also tried to run the agents process from Eclipse to be able to
> debug
> >> into it and was getting the same results.
> >>
> >>>
> >>> Thanks,
> >>> Karl
> >>
> >>
> >> Regards,
> >> Martin
> >>
> >>
> >>>
> >>>
> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow <
> martin.gielow@gmail.com>
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > I'm using Manifold to crawl several data sources using the Wiki and
> the
> >>> > JDBC
> >>> > connectors. I have set the associated jobs to run continuously so
> that
> >>> > new
> >>> > documents will be added in a timely manner. The problem I am having
> with
> >>> > this, is that whenever the Agent is stopped and then restarted, the
> jobs
> >>> > will delete all of their documents (also propagating the deletes to
> the
> >>> > associated output connection) before turning themselves inactive
> (which
> >>> > they
> >>> > shouldn't as they are set to run continuously).
> >>> >
> >>> > If I then restart the job, in case of the JDBC connection, it is not
> >>> > finding
> >>> > any previously added documents and will set itself inactive again.
In
> >>> > case
> >>> > of the Wiki connection, the documents are also deleted, but are
> >>> > successfully
> >>> > reindexed when the job is restartet manually.
> >>> >
> >>> > The only way I found to prevent the jobs from deleting their items
in
> >>> > this
> >>> > case, was to manually stop the affected jobs before the Agent is
> stopped
> >>> > (using the abort option) and to restart them after the Agent has been
> >>> > restarted.
> >>> >
> >>> >
> >>> > I am using the 1.0 release of Manifold and couldn't find anything
> >>> > regarding
> >>> > this behaviour in either the documentation or the wiki.
> >>> >
> >>> > Is there an obvious flaw with my setup or something I may have
> missed in
> >>> > the
> >>> > configuration?
> >>> >
> >>> > Thanks in advance for any tips!
> >>> >
> >>> > Regards,
> >>> > Martin
> >>
> >>
>

Mime
View raw message