manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with continuous jobs deleting their documents on restart of Agent
Date Tue, 09 Oct 2012 13:31:48 GMT
Hi Martin,

FWIW, the agents startup sequence also does not have logic which
deletes documents or jobs.

Nevertheless I will create a ticket and have a look at this ASAP.

Karl

On Tue, Oct 9, 2012 at 9:25 AM, Martin Gielow <martin.gielow@gmail.com> wrote:
> I have just completed testing the behaviour on the unaltered
> multiprocess-example using the provided HSQL instance.
>
> Indeed, when using the file system connector, Manifold works as it should.
> The agent can be stopped and restarted and the previously processed
> documents are retained. When I tried the JDBC (pointed to a MySQL DB) and
> Wiki connectors, however, I received the same results as yesterday - all
> documents are deleted as soon as the agent restarts (not on shutdown but
> when running the agent again after it has been stopped).
>
> For the JDBC connector I could imagine that this may somehow be related to
> flawed seeding or version queries (although I believe them to be ok), but in
> the case of Wiki there are hardly any settings I believe I could have gotten
> wrong.
>
>
> On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> I just tried this; the experiment yields no document deletions
>> recorded in the simple history (as expected).
>>
>> So clearly there is a complicating factor somewhere that you will need to
>> find.
>>
>> I would suggest going about the basic process of eliminating
>> variables.  For example, try a continuous crawl in your environment
>> using the file system connector on a moderately-sized set of sample
>> documents, and see if it seems to do the same thing as the other
>> connectors you are using.  If it does, then that would suggest that
>> one of your modifications was in fact causing the problem.  If not,
>> then I should look at trying to repeat the experiment here with one of
>> the connectors you are working with.
>>
>> Thanks,
>> Karl
>>
>> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>> > There is no logic whatsoever in agents-shutdown that should delete
>> > documents from the queue and from the index, and I have never seen
>> > this behavior before, but this is really easy to verify.  It should be
>> > simple to take an unaltered 1.0 distribution, create a filesystem job
>> > on the multiprocess example, start it crawling continuously, then stop
>> > and restart the agents process, and then look at the simple history to
>> > see whether any documents get deleted or not.  I may have time to try
>> > this later in the evening, we'll see.
>> >
>> > Karl
>> >
>> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <martin.gielow@gmail.com>
>> > wrote:
>> >> Hi Karl,
>> >>
>> >> thanks for the lightning-speed reply! :)
>> >>
>> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <daddywri@gmail.com> wrote:
>> >>>
>> >>> Hi Martin,
>> >>>
>> >>> The behavior you describe is expected only if you are either deleting
>> >>> the job, or the job is set to expire old documents after a certain
>> >>> time interval (and that interval has transpired).
>> >>>
>> >>> Can you tell me what your expiration interval is?
>> >>>
>> >>
>> >> The expiration interval is set to 1440 (minutes, according to the
>> >> interface). I also just tried to leave the box empty, so that there
>> >> should
>> >> be no expiration, but the behaviour remained the same.
>> >>
>> >>>
>> >>> Also, when you say "shutting down agents process", can you clarify
>> >>> what deployment model you are using?  How are you shutting down this
>> >>> process?
>> >>
>> >>
>> >> I am using a slightly modified version of the multiprocess-example with
>> >> postgres as the DBMS. To run and shutdown the agents I use the batch
>> >> files
>> >> that are provided with the example (start-agents.bat and
>> >> stop-agents.bat).
>> >> I have also tried to run the agents process from Eclipse to be able to
>> >> debug
>> >> into it and was getting the same results.
>> >>
>> >>>
>> >>> Thanks,
>> >>> Karl
>> >>
>> >>
>> >> Regards,
>> >> Martin
>> >>
>> >>
>> >>>
>> >>>
>> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow
>> >>> <martin.gielow@gmail.com>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > I'm using Manifold to crawl several data sources using the Wiki
and
>> >>> > the
>> >>> > JDBC
>> >>> > connectors. I have set the associated jobs to run continuously
so
>> >>> > that
>> >>> > new
>> >>> > documents will be added in a timely manner. The problem I am having
>> >>> > with
>> >>> > this, is that whenever the Agent is stopped and then restarted,
the
>> >>> > jobs
>> >>> > will delete all of their documents (also propagating the deletes
to
>> >>> > the
>> >>> > associated output connection) before turning themselves inactive
>> >>> > (which
>> >>> > they
>> >>> > shouldn't as they are set to run continuously).
>> >>> >
>> >>> > If I then restart the job, in case of the JDBC connection, it is
not
>> >>> > finding
>> >>> > any previously added documents and will set itself inactive again.
>> >>> > In
>> >>> > case
>> >>> > of the Wiki connection, the documents are also deleted, but are
>> >>> > successfully
>> >>> > reindexed when the job is restartet manually.
>> >>> >
>> >>> > The only way I found to prevent the jobs from deleting their items
>> >>> > in
>> >>> > this
>> >>> > case, was to manually stop the affected jobs before the Agent is
>> >>> > stopped
>> >>> > (using the abort option) and to restart them after the Agent has
>> >>> > been
>> >>> > restarted.
>> >>> >
>> >>> >
>> >>> > I am using the 1.0 release of Manifold and couldn't find anything
>> >>> > regarding
>> >>> > this behaviour in either the documentation or the wiki.
>> >>> >
>> >>> > Is there an obvious flaw with my setup or something I may have
>> >>> > missed in
>> >>> > the
>> >>> > configuration?
>> >>> >
>> >>> > Thanks in advance for any tips!
>> >>> >
>> >>> > Regards,
>> >>> > Martin
>> >>
>> >>
>
>

Mime
View raw message