manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with continuous jobs deleting their documents on restart of Agent
Date Tue, 09 Oct 2012 20:28:21 GMT
I've checked a fix for this into trunk.  See the ticket for details.

Please try trunk in your environment to be sure everything looks good
for you now.

I think it may be a serious enough problem to force a 1.01 patch
release.  Comments?

Karl

On Tue, Oct 9, 2012 at 9:36 AM, Karl Wright <daddywri@gmail.com> wrote:
> CONNECTORS-551
>
> FWIW, I will not be able to look at this for another few hours, most likely.
>
> Karl
>
>
> On Tue, Oct 9, 2012 at 9:31 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Hi Martin,
>>
>> FWIW, the agents startup sequence also does not have logic which
>> deletes documents or jobs.
>>
>> Nevertheless I will create a ticket and have a look at this ASAP.
>>
>> Karl
>>
>> On Tue, Oct 9, 2012 at 9:25 AM, Martin Gielow <martin.gielow@gmail.com> wrote:
>>> I have just completed testing the behaviour on the unaltered
>>> multiprocess-example using the provided HSQL instance.
>>>
>>> Indeed, when using the file system connector, Manifold works as it should.
>>> The agent can be stopped and restarted and the previously processed
>>> documents are retained. When I tried the JDBC (pointed to a MySQL DB) and
>>> Wiki connectors, however, I received the same results as yesterday - all
>>> documents are deleted as soon as the agent restarts (not on shutdown but
>>> when running the agent again after it has been stopped).
>>>
>>> For the JDBC connector I could imagine that this may somehow be related to
>>> flawed seeding or version queries (although I believe them to be ok), but in
>>> the case of Wiki there are hardly any settings I believe I could have gotten
>>> wrong.
>>>
>>>
>>> On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>> I just tried this; the experiment yields no document deletions
>>>> recorded in the simple history (as expected).
>>>>
>>>> So clearly there is a complicating factor somewhere that you will need to
>>>> find.
>>>>
>>>> I would suggest going about the basic process of eliminating
>>>> variables.  For example, try a continuous crawl in your environment
>>>> using the file system connector on a moderately-sized set of sample
>>>> documents, and see if it seems to do the same thing as the other
>>>> connectors you are using.  If it does, then that would suggest that
>>>> one of your modifications was in fact causing the problem.  If not,
>>>> then I should look at trying to repeat the experiment here with one of
>>>> the connectors you are working with.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>> > There is no logic whatsoever in agents-shutdown that should delete
>>>> > documents from the queue and from the index, and I have never seen
>>>> > this behavior before, but this is really easy to verify.  It should
be
>>>> > simple to take an unaltered 1.0 distribution, create a filesystem job
>>>> > on the multiprocess example, start it crawling continuously, then stop
>>>> > and restart the agents process, and then look at the simple history
to
>>>> > see whether any documents get deleted or not.  I may have time to try
>>>> > this later in the evening, we'll see.
>>>> >
>>>> > Karl
>>>> >
>>>> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <martin.gielow@gmail.com>
>>>> > wrote:
>>>> >> Hi Karl,
>>>> >>
>>>> >> thanks for the lightning-speed reply! :)
>>>> >>
>>>> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <daddywri@gmail.com>
wrote:
>>>> >>>
>>>> >>> Hi Martin,
>>>> >>>
>>>> >>> The behavior you describe is expected only if you are either
deleting
>>>> >>> the job, or the job is set to expire old documents after a certain
>>>> >>> time interval (and that interval has transpired).
>>>> >>>
>>>> >>> Can you tell me what your expiration interval is?
>>>> >>>
>>>> >>
>>>> >> The expiration interval is set to 1440 (minutes, according to the
>>>> >> interface). I also just tried to leave the box empty, so that there
>>>> >> should
>>>> >> be no expiration, but the behaviour remained the same.
>>>> >>
>>>> >>>
>>>> >>> Also, when you say "shutting down agents process", can you clarify
>>>> >>> what deployment model you are using?  How are you shutting down
this
>>>> >>> process?
>>>> >>
>>>> >>
>>>> >> I am using a slightly modified version of the multiprocess-example
with
>>>> >> postgres as the DBMS. To run and shutdown the agents I use the batch
>>>> >> files
>>>> >> that are provided with the example (start-agents.bat and
>>>> >> stop-agents.bat).
>>>> >> I have also tried to run the agents process from Eclipse to be able
to
>>>> >> debug
>>>> >> into it and was getting the same results.
>>>> >>
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Karl
>>>> >>
>>>> >>
>>>> >> Regards,
>>>> >> Martin
>>>> >>
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow
>>>> >>> <martin.gielow@gmail.com>
>>>> >>> wrote:
>>>> >>> > Hello,
>>>> >>> >
>>>> >>> > I'm using Manifold to crawl several data sources using
the Wiki and
>>>> >>> > the
>>>> >>> > JDBC
>>>> >>> > connectors. I have set the associated jobs to run continuously
so
>>>> >>> > that
>>>> >>> > new
>>>> >>> > documents will be added in a timely manner. The problem
I am having
>>>> >>> > with
>>>> >>> > this, is that whenever the Agent is stopped and then restarted,
the
>>>> >>> > jobs
>>>> >>> > will delete all of their documents (also propagating the
deletes to
>>>> >>> > the
>>>> >>> > associated output connection) before turning themselves
inactive
>>>> >>> > (which
>>>> >>> > they
>>>> >>> > shouldn't as they are set to run continuously).
>>>> >>> >
>>>> >>> > If I then restart the job, in case of the JDBC connection,
it is not
>>>> >>> > finding
>>>> >>> > any previously added documents and will set itself inactive
again.
>>>> >>> > In
>>>> >>> > case
>>>> >>> > of the Wiki connection, the documents are also deleted,
but are
>>>> >>> > successfully
>>>> >>> > reindexed when the job is restartet manually.
>>>> >>> >
>>>> >>> > The only way I found to prevent the jobs from deleting
their items
>>>> >>> > in
>>>> >>> > this
>>>> >>> > case, was to manually stop the affected jobs before the
Agent is
>>>> >>> > stopped
>>>> >>> > (using the abort option) and to restart them after the
Agent has
>>>> >>> > been
>>>> >>> > restarted.
>>>> >>> >
>>>> >>> >
>>>> >>> > I am using the 1.0 release of Manifold and couldn't find
anything
>>>> >>> > regarding
>>>> >>> > this behaviour in either the documentation or the wiki.
>>>> >>> >
>>>> >>> > Is there an obvious flaw with my setup or something I may
have
>>>> >>> > missed in
>>>> >>> > the
>>>> >>> > configuration?
>>>> >>> >
>>>> >>> > Thanks in advance for any tips!
>>>> >>> >
>>>> >>> > Regards,
>>>> >>> > Martin
>>>> >>
>>>> >>
>>>
>>>

Mime
View raw message