uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From priyank sharma <priyank.sha...@orkash.com>
Subject Re: DUCC's job goes into infintie loop
Date Wed, 15 Nov 2017 04:00:38 GMT
server down mean one out of three machine is disconnected from the 
cluster of three and all the services were deployed on the machine which 
was disconnect from the cluster.

Thanks and Regards
Priyank Sharma

On Tuesday 14 November 2017 04:08 PM, Lou DeGenaro wrote:
> What do you mean by "server down", precisely?  Since we have no logs to
> look at we can only go by your descriptions.  We're trying to help...
>
> Lou.
>
> On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <priyank.sharma@orkash.com>
> wrote:
>
>> When our job goes into infinite-loop that time uima analysis engine did
>> not start and one of the server out of three were down that server has all
>> the service which is being used by the uima analysis engine.
>>
>> Is the server down creates this issue?
>>
>> is memory the problem?
>>
>> Thanks and Regards
>> Priyank Sharma
>>
>> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>>
>>> Several different issues here. There is no "job completion cap", rather
>>> there is a limit on how long an individual work item will be allowed to
>>> process before it is labeled a timeout. The default number of such errors
>>> +
>>> exceptions before a Job is stopped is 15. Please increase this cap if you
>>> expect a work item to go longer.
>>>
>>> If a job process runs out of heap space it should go OOM at which point
>>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>>> logfiles?
>>>
>>> As for a bug, it is still hard to understand what is happening. Newer
>>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>>> daemon logfiles and state and makes it more likely we can understand what
>>> is happening. No user application logfiles are included in the captured
>>> tar
>>> file.
>>>
>>> Regards,
>>> Eddie
>>>
>>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>>> priyank.sharma@orkash.com>
>>> wrote:
>>>
>>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>>> into
>>>> the infinite loop with the same batch of the id's. We have a 75 minutes
>>>> cap
>>>> for a job to complete if not then its start again so after every 75
>>>> minutes
>>>> new job starts but with the same id batch as previous and not even a
>>>> single
>>>> document ingested in the data store it goes in the same state untill we
>>>> restarts the server.
>>>>
>>>> Is this because of the DUCC v2.0.1, are this version of DUCC having that
>>>> bug?
>>>>
>>>> Is this problem occur because of the Java Heap Space?
>>>>
>>>> Please suggest something as there are nothing in the logs regarding to my
>>>> problem.
>>>>
>>>> Thanks and Regards
>>>> Priyank Sharma
>>>>
>>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>>
>>>> Hi Priyank,
>>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed
in
>>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>>> ducc_update command that will upgrade an existing install, but given
all
>>>>> the changes since v2.0.x I suggest a clean install.
>>>>>
>>>>> Eddie
>>>>>
>>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>>> priyank.sharma@orkash.com>
>>>>> wrote:
>>>>>
>>>>> There is nothing on the work item page and performance page on the web
>>>>>
>>>>>> server. There is only one log file for the main node, no log files
for
>>>>>> other two nodes. Ducc job processes not able to pick the data from
the
>>>>>> data
>>>>>> source and no UIMA aggregator is working for that batches.
>>>>>>
>>>>>> Are the issue because of the java heap space? We are giving 4gb ram
to
>>>>>> the
>>>>>> job-process.
>>>>>>
>>>>>> Attaching the Log file.
>>>>>>
>>>>>> Thanks and Regards
>>>>>> Priyank Sharma
>>>>>>
>>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>>
>>>>>> The first place to look is in your job's logs.  Visit the ducc-mon
jobs
>>>>>>
>>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>>> Examine
>>>>>>> the
>>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>>> information.
>>>>>>>
>>>>>>> Feel free to post non-confidential snippets here, or If you'd
like to
>>>>>>> chat
>>>>>>> in real time we can use hipchat.
>>>>>>>
>>>>>>> Lou.
>>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>>> priyank.sharma@orkash.com
>>>>>>> wrote:
>>>>>>>
>>>>>>> All!
>>>>>>>
>>>>>>> I have a problem regarding DUCC cluster in which a job process
gets
>>>>>>>> stuck
>>>>>>>> and keeps on processing the same batch again and again due
to maximum
>>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>>> *"**CanceledByUser"
>>>>>>>> *and then gets restarted with the same ID's. This usually
happens
>>>>>>>> after
>>>>>>>> 15
>>>>>>>> to 20 days and goes away after restarting the ducc cluster.
While
>>>>>>>> going
>>>>>>>> through the data store that is being used by CAS consumer
to ingest
>>>>>>>> data,
>>>>>>>> the data regarding this batch does never get ingested. So
most
>>>>>>>> probably
>>>>>>>> this data is not being processed.
>>>>>>>>
>>>>>>>> How to check if this data is being processed or not?
>>>>>>>>
>>>>>>>> Are the resources the issue and why it is being processed
after
>>>>>>>> restarting
>>>>>>>> the cluster?
>>>>>>>>
>>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and
28 gb ram.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards
>>>>>>>> Priyank Sharma
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>


Mime
View raw message