manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with continuous jobs deleting their documents on restart of Agent
Date Tue, 09 Oct 2012 13:36:03 GMT
CONNECTORS-551

FWIW, I will not be able to look at this for another few hours, most likely.

Karl


On Tue, Oct 9, 2012 at 9:31 AM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Martin,
>
> FWIW, the agents startup sequence also does not have logic which
> deletes documents or jobs.
>
> Nevertheless I will create a ticket and have a look at this ASAP.
>
> Karl
>
> On Tue, Oct 9, 2012 at 9:25 AM, Martin Gielow <martin.gielow@gmail.com> wrote:
>> I have just completed testing the behaviour on the unaltered
>> multiprocess-example using the provided HSQL instance.
>>
>> Indeed, when using the file system connector, Manifold works as it should.
>> The agent can be stopped and restarted and the previously processed
>> documents are retained. When I tried the JDBC (pointed to a MySQL DB) and
>> Wiki connectors, however, I received the same results as yesterday - all
>> documents are deleted as soon as the agent restarts (not on shutdown but
>> when running the agent again after it has been stopped).
>>
>> For the JDBC connector I could imagine that this may somehow be related to
>> flawed seeding or version queries (although I believe them to be ok), but in
>> the case of Wiki there are hardly any settings I believe I could have gotten
>> wrong.
>>
>>
>> On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>> I just tried this; the experiment yields no document deletions
>>> recorded in the simple history (as expected).
>>>
>>> So clearly there is a complicating factor somewhere that you will need to
>>> find.
>>>
>>> I would suggest going about the basic process of eliminating
>>> variables.  For example, try a continuous crawl in your environment
>>> using the file system connector on a moderately-sized set of sample
>>> documents, and see if it seems to do the same thing as the other
>>> connectors you are using.  If it does, then that would suggest that
>>> one of your modifications was in fact causing the problem.  If not,
>>> then I should look at trying to repeat the experiment here with one of
>>> the connectors you are working with.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>>> > There is no logic whatsoever in agents-shutdown that should delete
>>> > documents from the queue and from the index, and I have never seen
>>> > this behavior before, but this is really easy to verify.  It should be
>>> > simple to take an unaltered 1.0 distribution, create a filesystem job
>>> > on the multiprocess example, start it crawling continuously, then stop
>>> > and restart the agents process, and then look at the simple history to
>>> > see whether any documents get deleted or not.  I may have time to try
>>> > this later in the evening, we'll see.
>>> >
>>> > Karl
>>> >
>>> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <martin.gielow@gmail.com>
>>> > wrote:
>>> >> Hi Karl,
>>> >>
>>> >> thanks for the lightning-speed reply! :)
>>> >>
>>> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <daddywri@gmail.com>
wrote:
>>> >>>
>>> >>> Hi Martin,
>>> >>>
>>> >>> The behavior you describe is expected only if you are either deleting
>>> >>> the job, or the job is set to expire old documents after a certain
>>> >>> time interval (and that interval has transpired).
>>> >>>
>>> >>> Can you tell me what your expiration interval is?
>>> >>>
>>> >>
>>> >> The expiration interval is set to 1440 (minutes, according to the
>>> >> interface). I also just tried to leave the box empty, so that there
>>> >> should
>>> >> be no expiration, but the behaviour remained the same.
>>> >>
>>> >>>
>>> >>> Also, when you say "shutting down agents process", can you clarify
>>> >>> what deployment model you are using?  How are you shutting down
this
>>> >>> process?
>>> >>
>>> >>
>>> >> I am using a slightly modified version of the multiprocess-example with
>>> >> postgres as the DBMS. To run and shutdown the agents I use the batch
>>> >> files
>>> >> that are provided with the example (start-agents.bat and
>>> >> stop-agents.bat).
>>> >> I have also tried to run the agents process from Eclipse to be able
to
>>> >> debug
>>> >> into it and was getting the same results.
>>> >>
>>> >>>
>>> >>> Thanks,
>>> >>> Karl
>>> >>
>>> >>
>>> >> Regards,
>>> >> Martin
>>> >>
>>> >>
>>> >>>
>>> >>>
>>> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow
>>> >>> <martin.gielow@gmail.com>
>>> >>> wrote:
>>> >>> > Hello,
>>> >>> >
>>> >>> > I'm using Manifold to crawl several data sources using the
Wiki and
>>> >>> > the
>>> >>> > JDBC
>>> >>> > connectors. I have set the associated jobs to run continuously
so
>>> >>> > that
>>> >>> > new
>>> >>> > documents will be added in a timely manner. The problem I am
having
>>> >>> > with
>>> >>> > this, is that whenever the Agent is stopped and then restarted,
the
>>> >>> > jobs
>>> >>> > will delete all of their documents (also propagating the deletes
to
>>> >>> > the
>>> >>> > associated output connection) before turning themselves inactive
>>> >>> > (which
>>> >>> > they
>>> >>> > shouldn't as they are set to run continuously).
>>> >>> >
>>> >>> > If I then restart the job, in case of the JDBC connection,
it is not
>>> >>> > finding
>>> >>> > any previously added documents and will set itself inactive
again.
>>> >>> > In
>>> >>> > case
>>> >>> > of the Wiki connection, the documents are also deleted, but
are
>>> >>> > successfully
>>> >>> > reindexed when the job is restartet manually.
>>> >>> >
>>> >>> > The only way I found to prevent the jobs from deleting their
items
>>> >>> > in
>>> >>> > this
>>> >>> > case, was to manually stop the affected jobs before the Agent
is
>>> >>> > stopped
>>> >>> > (using the abort option) and to restart them after the Agent
has
>>> >>> > been
>>> >>> > restarted.
>>> >>> >
>>> >>> >
>>> >>> > I am using the 1.0 release of Manifold and couldn't find anything
>>> >>> > regarding
>>> >>> > this behaviour in either the documentation or the wiki.
>>> >>> >
>>> >>> > Is there an obvious flaw with my setup or something I may have
>>> >>> > missed in
>>> >>> > the
>>> >>> > configuration?
>>> >>> >
>>> >>> > Thanks in advance for any tips!
>>> >>> >
>>> >>> > Regards,
>>> >>> > Martin
>>> >>
>>> >>
>>
>>

Mime
View raw message