Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E7D5102D3 for ; Mon, 8 Dec 2014 04:41:08 +0000 (UTC) Received: (qmail 91585 invoked by uid 500); 8 Dec 2014 04:41:08 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 91543 invoked by uid 500); 8 Dec 2014 04:41:08 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 91525 invoked by uid 99); 8 Dec 2014 04:41:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2014 04:41:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of reshu.agarwal@orkash.com designates 108.166.43.81 as permitted sender) Received: from [108.166.43.81] (HELO smtp81.ord1c.emailsrvr.com) (108.166.43.81) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Dec 2014 04:41:02 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp19.relay.ord1c.emailsrvr.com (SMTP Server) with ESMTP id DA7C51800D7; Sun, 7 Dec 2014 23:39:40 -0500 (EST) X-Virus-Scanned: OK Received: by smtp19.relay.ord1c.emailsrvr.com (Authenticated sender: reshu.agarwal-AT-orkash.com) with ESMTPSA id D5E971800E8 for ; Sun, 7 Dec 2014 23:39:39 -0500 (EST) X-Sender-Id: reshu.agarwal@orkash.com Received: from [192.168.0.117] ([UNAVAILABLE]. [14.141.49.198]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA) by 0.0.0.0:465 (trex/5.4.1); Mon, 08 Dec 2014 04:39:40 GMT Message-ID: <54852BAC.9080209@orkash.com> Date: Mon, 08 Dec 2014 10:10:12 +0530 From: "reshu.agarwal" Organization: Orkash Services Pvt Ltd User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: DUCC-unstable behaviour od ducc References: <54805667.4060205@orkash.com> <5481458F.5040206@orkash.com> <5481979A.50105@orkash.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------020806050706000603090805" X-Virus-Checked: Checked by ClamAV on apache.org --------------020806050706000603090805 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Yes, I am running both at same time. But I tried only 1.1.0 version to check the performance.But, due to unstable behaviour I had to run DUCC 1.0.0 and DUCC 1.1.0 at the same time. I am running DUCC 1.0.0 for running Jobs and DUCC 1.1.0 for testing purpose. Do I need to increase heartbeats timing to greater than to 60 sec? Signature **Reshu. On 12/05/2014 05:57 PM, Lou DeGenaro wrote: > You can fetch the latest code containing the bug fix from SVN and build > your own snapshot. However, this bug is of minimal impact so there is no > pressing need to do so. > > Are you trying to run 1.0 and 1.1 at the same time? This can be very > tricky. You need to be sure of no overlaps. I highly recommend that you > pick one or the other. > > Lou. > > On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal > wrote: > >> Dear Lou, >> >> Thanks for confirming this. >> >> Is Bug fixing version available for use? >> >> What can be the reason of delaying in heartbeats? Because machines were >> not able to send heartbeats with in 60 seconds so node gets down in DUCC >> 1.1.0 but DUCC 1.0.0 is working fine on same machines. >> >> My master node is physical and client is on virtual. Can this be a reason >> for delaying in heartbeats as well as increase of processing time of job? >> >> Thanks. >> >> Reshu. >> >> >> On 12/05/2014 04:45 PM, Lou DeGenaro wrote: >> >>> Each node has a DUCC Agent daemon that sends heartbeats. >>> >>> There was a bug discovered after the release of 1.1 whereby the share >>> calculation is incorrect (a rounding up problem that you observe). The >>> impact of this bug should be minimal. The bug has been fixed. >>> >>> Lou. >>> >>> >>> >>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal >>> wrote: >>> >>> Lou, >>>> How can a node send heartbeats in DUCC? If you can tell me this I will be >>>> able to identify problem of down in my nodes. >>>> >>>> The other problem which I am facing is: >>>> >>>> Memory(GB):total : 31 >>>> Memory(GB):usable : 16 >>>> Shares:total : 8 >>>> Shares:inuse : 9 >>>> >>>> >>>> Means actual RAM which is available is 30 GB so shares available should >>>> be >>>> 15(2GB per share) but it is showing Memory(GB):usable : 16 and >>>> Shares:total : 8. >>>> >>>> In DUCC 1.0.0, I don't face this problem. >>>> >>>> Please explain me its reason. >>>> >>>> Reshu. >>>> >>>> >>>> >>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote: >>>> >>>> Which of these are no understandable? If you hover over the column >>>>> heading >>>>> a little more explanation is given (though not much). >>>>> >>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed >>>>> time >>>>> (in seconds) since the last heartbeat". This should usually be around >>>>> 60 >>>>> seconds. On the system I'm looking at live presently, I see a range >>>>> from >>>>> 9 >>>>> to 66. If the number gets too large, the DUCC system will consider the >>>>> node down. As best as I can tell, it looks like your numbers are 58 & >>>>> 59 >>>>> which is perfect. >>>>> >>>>> Lou. >>>>> >>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>>> Please look this stats: >>>>>> >>>>>> / Status Name Memory(GB):usable Memory(GB):total >>>>>> Swap(GB):inuse >>>>>> Swap(GB):free Alien PIDs Shares:total Shares:inuse >>>>>> Heartbeat >>>>>> (last) >>>>>> Total 58 70 >>>>>> 0 29 9 29 >>>>>> 3 >>>>>> up S144 36 39 >>>>>> 0 20 8 18 2 >>>>>> 59 >>>>>> down S143 22 31 >>>>>> 0 9 1 11 11 >>>>>> 58 >>>>>> / >>>>>> I am not able to understand this stats. >>>>>> >>>>>> Please help. >>>>>> >>>>>> Reshu. >>>>>> >>>>>> >>>>>> >>>>>> --------------020806050706000603090805--