uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "reshu.agarwal" <reshu.agar...@orkash.com>
Subject Re: DUCC-unstable behaviour od ducc
Date Mon, 08 Dec 2014 04:40:12 GMT

Yes, I am running both at same time. But I tried only 1.1.0 version to 
check the performance.But, due to unstable behaviour I had to run DUCC 
1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for 
running Jobs and DUCC 1.1.0 for testing purpose.

Do I need to increase heartbeats timing to greater than to 60 sec?
Signature

**Reshu.

On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
> You can fetch the latest code containing the bug fix from SVN and build
> your own snapshot.  However, this bug is of minimal impact so there is no
> pressing need to do so.
>
> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
> tricky.  You need to be sure of no overlaps.  I highly recommend that you
> pick one or the other.
>
> Lou.
>
> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <reshu.agarwal@orkash.com>
> wrote:
>
>> Dear Lou,
>>
>> Thanks for confirming this.
>>
>> Is Bug fixing version available for use?
>>
>> What can be the reason of delaying in heartbeats? Because machines were
>> not able to send heartbeats with in 60 seconds so node gets down in DUCC
>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>
>> My master node is physical and client is on virtual. Can this be a reason
>> for delaying in heartbeats as well as increase of processing time of job?
>>
>> Thanks.
>>
>> Reshu.
>>
>>
>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>
>>> Each node has a DUCC Agent daemon that sends heartbeats.
>>>
>>> There was a bug discovered after the release of 1.1 whereby the share
>>> calculation is incorrect (a rounding up problem that you observe).  The
>>> impact of this bug should be minimal.  The bug has been fixed.
>>>
>>> Lou.
>>>
>>>
>>>
>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <reshu.agarwal@orkash.com>
>>> wrote:
>>>
>>>   Lou,
>>>> How can a node send heartbeats in DUCC? If you can tell me this I will be
>>>> able to identify problem of down in my nodes.
>>>>
>>>> The other problem which I am facing is:
>>>>
>>>> Memory(GB):total    :   31
>>>> Memory(GB):usable :   16
>>>> Shares:total             :    8
>>>> Shares:inuse            :   9
>>>>
>>>>
>>>> Means actual RAM which is available is 30 GB so shares available should
>>>> be
>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>>> Shares:total             :    8.
>>>>
>>>> In DUCC 1.0.0, I don't face this problem.
>>>>
>>>> Please explain me its reason.
>>>>
>>>> Reshu.
>>>>
>>>>
>>>>
>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>
>>>>   Which of these are no understandable?  If you hover over the column
>>>>> heading
>>>>> a little more explanation is given (though not much).
>>>>>
>>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>>>> time
>>>>> (in seconds) since the last heartbeat".  This should usually be around
>>>>> 60
>>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>>> from
>>>>> 9
>>>>> to 66.  If the number gets too large, the DUCC system will consider the
>>>>> node down.  As best as I can tell, it looks like your numbers are 58
&
>>>>> 59
>>>>> which is perfect.
>>>>>
>>>>> Lou.
>>>>>
>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <reshu.agarwal@orkash.com
>>>>> wrote:
>>>>>
>>>>>    Hi,
>>>>>
>>>>>> Please look this stats:
>>>>>>
>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>> Swap(GB):inuse
>>>>>>      Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>> Heartbeat
>>>>>> (last)
>>>>>>        Total                                        58 70
>>>>>>            0 29                         9                 29
>>>>>>      3
>>>>>>        up    S144                               36 39
>>>>>>        0 20                         8                18 2
>>>>>>     59
>>>>>>        down    S143                           22 31
>>>>>>      0 9                           1                11 11
>>>>>>     58
>>>>>> /
>>>>>> I am not able to understand this stats.
>>>>>>
>>>>>> Please help.
>>>>>>
>>>>>> Reshu.
>>>>>>
>>>>>>
>>>>>>
>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message