uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lou DeGenaro <lou.degen...@gmail.com>
Subject Re: DUCC-unstable behaviour od ducc
Date Wed, 10 Dec 2014 13:11:09 GMT
Are the machines where your DUCC daemons and/or agents run extremely busy?
Otherwise, I should think that the default heartbeat config should work as
is.

Lou.

On Wed, Dec 10, 2014 at 4:06 AM, reshu.agarwal <reshu.agarwal@orkash.com>
wrote:

> Dear Lou,
>
> My problem has been resolved. I just increased the max time of receiving
> Heartbeats of agents.
>
> The "unstable behavior" of DUCC 1.1.0 in my case was the node up and down
> problem in both cases either on single instance of DUCC 1.1.0
> or running both ducc versions simultaneously.
>
> And Now, I am able to run DUCC 1.1.0 alone. So, Only DUCC 1.1.0 is
> configured.
>
> Thanks for your help. :-)
>
> Reshu.
>
>
>
>
> On 12/08/2014 04:24 PM, Lou DeGenaro wrote:
>
>> What is the "unstable behavior" of DUCC 1.1.0 when running it alone?
>>
>> All kinds of bad things can happen if you run 2 DUCCs on the same set of
>> machines. I'm willing to help, but am not sure I can if you are running 2
>> DUCCs - that's fairly complex.  Instead I urge you to run a single DUCC
>> 1.1.0 and let's try to fix what's wrong with running it alone.
>>
>> Lou.
>>
>> On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <reshu.agarwal@orkash.com>
>> wrote:
>>
>>  Yes, I am running both at same time. But I tried only 1.1.0 version to
>>> check the performance.But, due to unstable behaviour I had to run DUCC
>>> 1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for
>>> running
>>> Jobs and DUCC 1.1.0 for testing purpose.
>>>
>>> Do I need to increase heartbeats timing to greater than to 60 sec?
>>> Signature
>>>
>>> **Reshu.
>>>
>>>
>>> On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
>>>
>>>  You can fetch the latest code containing the bug fix from SVN and build
>>>> your own snapshot.  However, this bug is of minimal impact so there is
>>>> no
>>>> pressing need to do so.
>>>>
>>>> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
>>>> tricky.  You need to be sure of no overlaps.  I highly recommend that
>>>> you
>>>> pick one or the other.
>>>>
>>>> Lou.
>>>>
>>>> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <reshu.agarwal@orkash.com
>>>> >
>>>> wrote:
>>>>
>>>>   Dear Lou,
>>>>
>>>>> Thanks for confirming this.
>>>>>
>>>>> Is Bug fixing version available for use?
>>>>>
>>>>> What can be the reason of delaying in heartbeats? Because machines were
>>>>> not able to send heartbeats with in 60 seconds so node gets down in
>>>>> DUCC
>>>>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>>>>
>>>>> My master node is physical and client is on virtual. Can this be a
>>>>> reason
>>>>> for delaying in heartbeats as well as increase of processing time of
>>>>> job?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Reshu.
>>>>>
>>>>>
>>>>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>>>>
>>>>>   Each node has a DUCC Agent daemon that sends heartbeats.
>>>>>
>>>>>> There was a bug discovered after the release of 1.1 whereby the share
>>>>>> calculation is incorrect (a rounding up problem that you observe).
>>>>>> The
>>>>>> impact of this bug should be minimal.  The bug has been fixed.
>>>>>>
>>>>>> Lou.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <
>>>>>> reshu.agarwal@orkash.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Lou,
>>>>>>
>>>>>>  How can a node send heartbeats in DUCC? If you can tell me this
I
>>>>>>> will
>>>>>>> be
>>>>>>> able to identify problem of down in my nodes.
>>>>>>>
>>>>>>> The other problem which I am facing is:
>>>>>>>
>>>>>>> Memory(GB):total    :   31
>>>>>>> Memory(GB):usable :   16
>>>>>>> Shares:total             :    8
>>>>>>> Shares:inuse            :   9
>>>>>>>
>>>>>>>
>>>>>>> Means actual RAM which is available is 30 GB so shares available
>>>>>>> should
>>>>>>> be
>>>>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16
and
>>>>>>> Shares:total             :    8.
>>>>>>>
>>>>>>> In DUCC 1.0.0, I don't face this problem.
>>>>>>>
>>>>>>> Please explain me its reason.
>>>>>>>
>>>>>>> Reshu.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>>>>
>>>>>>>    Which of these are no understandable?  If you hover over the
>>>>>>> column
>>>>>>>
>>>>>>>  heading
>>>>>>>> a little more explanation is given (though not much).
>>>>>>>>
>>>>>>>> For example, If you hover over Heartbeat(last) you'll see
"The
>>>>>>>> elapsed
>>>>>>>> time
>>>>>>>> (in seconds) since the last heartbeat".  This should usually
be
>>>>>>>> around
>>>>>>>> 60
>>>>>>>> seconds.  On the system I'm looking at live presently, I
see a range
>>>>>>>> from
>>>>>>>> 9
>>>>>>>> to 66.  If the number gets too large, the DUCC system will
consider
>>>>>>>> the
>>>>>>>> node down.  As best as I can tell, it looks like your numbers
are
>>>>>>>> 58 &
>>>>>>>> 59
>>>>>>>> which is perfect.
>>>>>>>>
>>>>>>>> Lou.
>>>>>>>>
>>>>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <
>>>>>>>> reshu.agarwal@orkash.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     Hi,
>>>>>>>>
>>>>>>>>   Please look this stats:
>>>>>>>>
>>>>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>>>>> Swap(GB):inuse
>>>>>>>>>       Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>>>>> Heartbeat
>>>>>>>>> (last)
>>>>>>>>>         Total                                       
58 70
>>>>>>>>>             0 29                         9          
      29
>>>>>>>>>       3
>>>>>>>>>         up    S144                               36 39
>>>>>>>>>         0 20                         8              
 18 2
>>>>>>>>>      59
>>>>>>>>>         down    S143                           22 31
>>>>>>>>>       0 9                           1               
11 11
>>>>>>>>>      58
>>>>>>>>> /
>>>>>>>>> I am not able to understand this stats.
>>>>>>>>>
>>>>>>>>> Please help.
>>>>>>>>>
>>>>>>>>> Reshu.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message