Return-Path: X-Original-To: apmail-ignite-dev-archive@minotaur.apache.org Delivered-To: apmail-ignite-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E8EC1793F for ; Mon, 27 Jul 2015 14:15:17 +0000 (UTC) Received: (qmail 65594 invoked by uid 500); 27 Jul 2015 14:08:34 -0000 Delivered-To: apmail-ignite-dev-archive@ignite.apache.org Received: (qmail 65550 invoked by uid 500); 27 Jul 2015 14:08:34 -0000 Mailing-List: contact dev-help@ignite.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.incubator.apache.org Delivered-To: mailing list dev@ignite.incubator.apache.org Received: (qmail 65539 invoked by uid 99); 27 Jul 2015 14:08:33 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2015 14:08:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 70C74D7F29 for ; Mon, 27 Jul 2015 14:08:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id e8FZ5m4SjPWB for ; Mon, 27 Jul 2015 14:08:22 +0000 (UTC) Received: from mail-lb0-f172.google.com (mail-lb0-f172.google.com [209.85.217.172]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id A6E0323130 for ; Mon, 27 Jul 2015 14:08:21 +0000 (UTC) Received: by lblf12 with SMTP id f12so54244668lbl.2 for ; Mon, 27 Jul 2015 07:07:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-type; bh=czQ0i2NWIKFkGANtlZ36/iBDDooOn9Pv6pZMHs9s2zQ=; b=TUhYunqyxDUE+Zvw/T0VZkKk3pUH3Nvd2OZOWpNFOZJtzmPQ3GmwLwMvbFpLNI//rc ob8zC/chMQGnopmpLT07MN5T9KzjGObl4rUbMwJbTKrindLtSTy3RqnjjDiUcV3TPEfZ ebPCLfo70oWELpKsR9gy2uYqgEv6gCI0iKXaZzNUhh27H4xz2Op/YtFT8KlJqbqFCIvg eS7oFX8T6nQ4Bml6iY7XI0CsSPft2oIFQwsmSL3oYU+lGSElEl2n7f6xlQPz6EHNjXYv hJCzkKjc1RSK8mRQWUl3FBDCc2jCdD/cJTeIx8eFlerXJCwRnEJ6Xx2YTEcQseje+d9s /E4w== X-Gm-Message-State: ALoCoQmYmiI4xxkzV0j2aQv0++TWktb1+XRKsZNezA6h6l8a6jCjUWZTAFuz7QE9xkTGn7Ma+SO7 X-Received: by 10.152.43.16 with SMTP id s16mr27066474lal.101.1438006048304; Mon, 27 Jul 2015 07:07:28 -0700 (PDT) Received: from [10.0.0.2] ([94.72.60.102]) by smtp.googlemail.com with ESMTPSA id tj8sm3970612lbb.22.2015.07.27.07.07.27 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Jul 2015 07:07:27 -0700 (PDT) Subject: Re: Stopped working on IGNITE-752 (speed up failure detection) To: dev@ignite.incubator.apache.org References: <55B63805.6010303@gridgain.com> From: Denis Magda Message-ID: <55B63B1F.8000403@gridgain.com> Date: Mon, 27 Jul 2015 17:07:27 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <55B63805.6010303@gridgain.com> Content-Type: multipart/alternative; boundary="------------050408030602020402000809" --------------050408030602020402000809 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sorry, forgot that attaches are not allowed. Attach to a public URL mapping: 1) ignite-results-failure-detection.zip -> https://goo.gl/5mitfS 2) ignite-results-no-failure-detection-explicit-timeouts.zip -> https://goo.gl/as4qph 3) ignite-results-1.3.0.zip -> https://goo.gl/m8lbiR -- Denis On 7/27/2015 4:54 PM, Denis Magda wrote: > Dmitriy, Igniters, > > I've got the first yardstick benchmarking results on Amazon EC2. > Thanks Nikolay for guidance and ready to use yardstick docker image. > > Used configuration is the following - c4.xlarge, 5 server nodes, 1 > backup, running put/get benchmark, manually stopping one instance > during the execution. > Time to warmup 60 seconds, execution time 150 seconds, 64 threads. > > 1) Failure detection timeout set to *300 ms. > *Unfortunately, a drop during a kill of one server nodes is > significant. Please see a resulting plot in > ignite-results-failure-detection.zip. > > Making the timeout lower doesn't improve the situation. > > Right after that I've decided to run the same benchmark with failure > detection timeout ignored by setting several network related timeouts > explicitly (these timeouts were used before when we got insignificant > drop). > TcpCommunicationSpi.setSocketWriteTimeout(200) > TcpDiscoverySpi.setAckTimeout(50) > TcpDiscoverySpi.setNetworkTimeout(5000) > TcpDiscoverySpi.setHeartbeatFrequency(100) > > 2) Explicitly set the timeouts above, run against the latest changes > including mine. > Here I saw pretty the same result - the drop is again signification. > Have a look at the plot in > ignite-results-no-failure-detection-explicit-timeouts.zip. > > 3) Well, the final sanity check was done over the latest release - > ignite-1.3.0-incubation that does NOT contain my changes. The timeouts > were the same as in 2). > Unfortunately, here I see the same drop as well. Look into > ignite-results-1.3.0.zip. > > Seems that we got that drop even before my 'failure detection timeout' > changes were merged, if refer to 3). Will try to debug all this stuff > better tomorrow. > > -- > Denis > > On 7/24/2015 7:15 PM, Dmitriy Setrakyan wrote: >> Thanks Denis! >> >> This feature significantly simplifies failure detection configuration in >> Ignite - just one configuration flag now vs. don't even remember how many. >> >> Have you run a yardstick test on Amazon EC2 with this new configuration >> flag? If we kill a node in the middle, then drop should be insignificant. >> >> Also, I want to note your excellent handling of Jira communication. The >> ticket has been thoroughly updated every step of the way. >> >> D. >> >> On Fri, Jul 24, 2015 at 5:37 AM, Denis Magda wrote: >> >>> Igniters, >>> >>> Have just back merged the changes into the main development branch. Thanks >>> Yakov and Dmitriy for spending your time on review! >>> >>> From now it’s possible to detect failures at cluster nodes' >>> discovery/communication/network levels by altering a single parameter - >>> IgniteConfiguration.failureDetectionTimeout. >>> >>> By setting the failure detection timeout for a server node it will be >>> possible to detect failed nodes in a cluster topology during the time equal >>> to timeout's value and switch to/keep working with only alive nodes. >>> By setting the timeout for a client node will let us to detect failures >>> between the client and its router node (a server node that is a part of a >>> topology). >>> >>> In addition, bunch of other improvements and simplifications were done at >>> the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are >>> aggregated here: >>> https://issues.apache.org/jira/browse/IGNITE-752 < https://issues.apache.org/jira/browse/IGNITE-752> >>> >>> — >>> Denis > --------------050408030602020402000809--