Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FC259B8B for ; Sun, 19 Feb 2012 11:05:41 +0000 (UTC) Received: (qmail 30289 invoked by uid 500); 19 Feb 2012 11:05:40 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 30232 invoked by uid 500); 19 Feb 2012 11:05:40 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 30221 invoked by uid 99); 19 Feb 2012 11:05:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Feb 2012 11:05:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 216.145.54.173 is neither permitted nor denied by domain of fpj@yahoo-inc.com) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Feb 2012 11:05:32 +0000 Received: from [192.168.1.101] (vpn-client-20-37.corp.ukl.yahoo.com [10.76.20.37]) by mrout3.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id q1JB4mcx015075 for ; Sun, 19 Feb 2012 03:04:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1329649493; bh=ytgpFTGR/vieEfBGaaOTpezc9MkbI9oTLAkvnlVcxNs=; h=From:Mime-Version:Content-Type:Subject:Date:In-Reply-To:To: References:Message-Id; b=U3VYe0vgknQHuNhzyiOlPI5wYf+GOtWn20NXT6Z+nRLgm+EDrqmIquieCk9vbdKwS 9eIfxY+N4YR3DPe4hHe8t/YqKeQiuTL5QkKfmmQR9Ql1xRW5vyetVgucjJUL+B5I++ m84EizyKnJEi8W/nXYNg3YnPWiSVoeQsA/keJwc0= From: Flavio Junqueira Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-6--845030409 Subject: Re: sync vs. async vs. multi performances Date: Sun, 19 Feb 2012 12:04:47 +0100 In-Reply-To: To: "user@zookeeper.apache.org" References: <5CE7411A-56EF-4461-BB2F-9A954600A1AF@gmail.com> <29597331-BEB9-43C7-A0A7-DB3091B830E6@yahoo-inc.com> Message-Id: X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-6--845030409 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi Ariel, Here is what they mean: Net means the overhead of the replication protocol only, not = writing to disk Net+disk means the overhead of the replication protocol with = writes to disk enabled=09 Net+disk (no write cache) same as the previous one, and we have = turned the write cache of the disk off -Flavio =20 On Feb 18, 2012, at 4:17 PM, Ariel Weisberg wrote: > Hi, >=20 > In that diagram, what is the difference between net, net + disk, and = net + > disk (no write cache)? >=20 > Thanks, > Ariel >=20 > On Fri, Feb 17, 2012 at 3:41 AM, Flavio Junqueira = wrote: >=20 >> Hi Ariel, That wiki is stale. Check it here: >>=20 >>=20 >> = https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentatio= ns >>=20 >> In particular check the HIC talk, slide 57. We were using 1k byte = writes >> for those tests. >>=20 >> -Flavio >>=20 >> On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote: >>=20 >>> Hi, >>>=20 >>> I tried to look at the presentations on the wiki, but the links = aren't >>> working? I was using >>> http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and = the >>> error at the top of the page is "You are not allowed to do = AttachFile on >>> this page. Login and try again." >>>=20 >>> I used (http://pastebin.com/uu7igM3J) and the results for 4k writes = were >>> http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit = slower >> than >>> 5. Is it possible to beat the rotation speed? >>>=20 >>> You can increase the write size quite a bit to 240k and it only goes = up >> to >>> 10 milliseconds. http://pastebin.com/MSTwaHYN >>>=20 >>> My recollection was being in the 12-14 range, but I may be thinking = of >> when >>> I was pushing throughput. >>>=20 >>> Ariel >>>=20 >>> On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira = >> wrote: >>>=20 >>>> Some of our previous measurements gave us around 5ms, check some of = our >>>> presentations we uploaded to the wiki. Those use 7.2k RPM disks and = not >>>> only volatile storage or battery backed cache. We do have the write >> cache >>>> on for the numbers I'm referring to. There are also numbers there = when >> the >>>> write cache is off. >>>>=20 >>>> -Flavio >>>>=20 >>>> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote: >>>>=20 >>>>> Hi, >>>>>=20 >>>>> It's only a minute of you process each region serially. Process = 100 or >>>> 1000 >>>>> in parallel and it will go a lot faster. >>>>>=20 >>>>> 20 milliseconds to synchronously commit to a 5.4k disk is about = right. >>>> This >>>>> is assuming the configuration for this is correct. On ext3 you = need to >>>>> mount with barrier=3D1 (ext4, xfs enable write barriers by = default). If >>>>> someone is getting significantly faster numbers they are probably >> writing >>>>> to a volatile or battery backed cache. >>>>>=20 >>>>> Performance is relative. The number of operations the DB can do is >>>> roughly >>>>> constant although multi may be able to more efficiently batch >> operations >>>> by >>>>> amortizing all the coordination overhead. >>>>>=20 >>>>> In the synchronous case the DB is starved for work %99 of the time = so >> it >>>> is >>>>> not surprising that it is slow. You are benchmarking round trip = time in >>>>> that case, and that is dominated by the time it takes to = synchronously >>>>> commmit something to disk. >>>>>=20 >>>>> In the asynchronous case there is plenty of work and you can fully >>>> utilize >>>>> all the throughput available to get it done because each fsync = makes >>>>> multiple operations durable. However the work is still presented >>>> piecemeal >>>>> so there is per-operation overhead. >>>>>=20 >>>>> Caveat, I am on 3.3.3 so I haven't read how multi operations are >>>>> implemented, but the numbers you are getting bear this out. In the >>>>> multi-case you are getting the benefit of keeping the DB fully = utilized >>>>> plus amortizing the coordination overhead across multiple = operations so >>>> you >>>>> get a boost in throughput beyond just async. >>>>>=20 >>>>> Ariel >>>>>=20 >>>>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal = wrote: >>>>>=20 >>>>>> Hi, >>>>>>=20 >>>>>> Thanks for the replies. >>>>>>=20 >>>>>> It's used when assigning the regions (kind of dataset) to the >>>> regionserver >>>>>> (jvm process in a physical server). There is one zookeeper node = per >>>> region. >>>>>> On a server failure, there is typically a few hundreds regions to >>>> reassign, >>>>>> with multiple status written in . On paper, if we need 0,02s per = node, >>>> that >>>>>> makes it to the minute to recover, just for zookeeper. >>>>>>=20 >>>>>> That's theory. I haven't done a precise measurement yet. >>>>>>=20 >>>>>>=20 >>>>>> Anyway, if ZooKeeper can be faster, it's always very interesting = :-) >>>>>>=20 >>>>>>=20 >>>>>> Cheers, >>>>>>=20 >>>>>> N. >>>>>>=20 >>>>>>=20 >>>>>> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning = >>>>>> wrote: >>>>>>=20 >>>>>>> These results are about what is expected although the might be a >> little >>>>>>> more extreme. >>>>>>>=20 >>>>>>> I doubt very much that hbase is mutating zk nodes fast enough = for >> this >>>> to >>>>>>> matter much. >>>>>>>=20 >>>>>>> Sent from my iPhone >>>>>>>=20 >>>>>>> On Feb 14, 2012, at 8:00, N Keywal wrote: >>>>>>>=20 >>>>>>>> Hi, >>>>>>>>=20 >>>>>>>> I've done a test with Zookeeper 3.4.2 to compare the = performances of >>>>>>>> synchronous vs. asynchronous vs. multi when creating znode >> (variations >>>>>>>> around: >>>>>>>> calling 10000 times zk.create("/dummyTest", "dummy".getBytes(), >>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The code = is at >>>>>> the >>>>>>>> end of the mail. >>>>>>>>=20 >>>>>>>> I've tested different environments: >>>>>>>> - 1 linux server with the client and 1 zookeeper node on the = same >>>>>> machine >>>>>>>> - 1 linux server for the client, 1 for 1 zookeeper node. >>>>>>>> - 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes. >>>>>>>>=20 >>>>>>>> Server are middle range, with 4*2 cores, jdk 1.6. ZK was on its = own >>>> HD. >>>>>>>>=20 >>>>>>>> But the results are comparable: >>>>>>>>=20 >>>>>>>> Using the sync API, it takes 200 seconds for 10K creations, so >> around >>>>>>> 0.02 >>>>>>>> second per call. >>>>>>>> Using the async API, it takes 2 seconds for 10K (including = waiting >> for >>>>>>> the >>>>>>>> last callback message) >>>>>>>> Using the "multi" available since 3.4, it takes less than 1 = second, >>>>>> again >>>>>>>> for 10K. >>>>>>>>=20 >>>>>>>> I'm surprised by the time taken by the sync operation, I was = not >>>>>>> expecting >>>>>>>> it to be that slow. The gap between async & sync is quite huge. >>>>>>>>=20 >>>>>>>> Is this something expected? Zookeeper is used in critical = functions >> in >>>>>>>> Hadoop/Hbase, I was looking at the possible benefits of using >> "multi", >>>>>>> but >>>>>>>> it seems low compared to async (well ~3 times faster :-). There = are >>>>>> many >>>>>>>> small data creations/deletions with the sync API in the = existing >> hbase >>>>>>>> algorithms, it would not be simple to replace them all by >> asynchronous >>>>>>>> calls... >>>>>>>>=20 >>>>>>>> Cheers, >>>>>>>>=20 >>>>>>>> N. >>>>>>>>=20 >>>>>>>> -- >>>>>>>>=20 >>>>>>>> public class ZookeeperTest { >>>>>>>> static ZooKeeper zk; >>>>>>>> static int nbTests =3D 10000; >>>>>>>>=20 >>>>>>>> private ZookeeperTest() { >>>>>>>> } >>>>>>>>=20 >>>>>>>> public static void test11() throws Exception { >>>>>>>> for (int i =3D 0; i < nbTests; ++i) { >>>>>>>> zk.create("/dummyTest_" + i, "dummy".getBytes(), >>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); >>>>>>>> } >>>>>>>> } >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> public static void test51() throws Exception { >>>>>>>> final AtomicInteger counter =3D new AtomicInteger(0); >>>>>>>>=20 >>>>>>>> for (int i =3D 0; i < nbTests; ++i) { >>>>>>>> zk.create("/dummyTest_" + i, "dummy".getBytes(), >>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT, >>>>>>>> new AsyncCallback.StringCallback() { >>>>>>>> public void processResult(int i, String s, Object o, = String >>>>>> s1) >>>>>>> { >>>>>>>> counter.incrementAndGet(); >>>>>>>> } >>>>>>>> } >>>>>>>> , null); >>>>>>>> } >>>>>>>>=20 >>>>>>>> while (counter.get() !=3D nbTests) { >>>>>>>> Thread.sleep(1); >>>>>>>> } >>>>>>>> } >>>>>>>>=20 >>>>>>>> public static void test41() throws Exception { >>>>>>>> ArrayList ops =3D new ArrayList(nbTests); >>>>>>>> for (int i =3D 0; i < nbTests; ++i) { >>>>>>>> ops.add( >>>>>>>> Op.create("/dummyTest_" + i, "dummy".getBytes(), >>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT) >>>>>>>> ); >>>>>>>> } >>>>>>>>=20 >>>>>>>> zk.multi(ops); >>>>>>>> } >>>>>>>>=20 >>>>>>>> public static void delete() throws Exception { >>>>>>>> ArrayList ops =3D new ArrayList(nbTests); >>>>>>>>=20 >>>>>>>> for (int i =3D 0; i < nbTests; ++i) { >>>>>>>> ops.add( >>>>>>>> Op.delete("/dummyTest_" + i,-1) >>>>>>>> ); >>>>>>>> } >>>>>>>>=20 >>>>>>>> zk.multi(ops); >>>>>>>> } >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> public static void test(String connection, String testName) = throws >>>>>>>> Throwable{ >>>>>>>> Method m =3D ZookeeperTest.class.getMethod(testName); >>>>>>>>=20 >>>>>>>> zk =3D new ZooKeeper(connection, 20000, new Watcher() { >>>>>>>> public void process(WatchedEvent watchedEvent) { >>>>>>>> } >>>>>>>> }); >>>>>>>>=20 >>>>>>>> final long start =3D System.currentTimeMillis(); >>>>>>>>=20 >>>>>>>> try { >>>>>>>> m.invoke(null); >>>>>>>> } catch (IllegalAccessException e) { >>>>>>>> throw e; >>>>>>>> } catch (InvocationTargetException e) { >>>>>>>> throw e.getTargetException(); >>>>>>>> } >>>>>>>>=20 >>>>>>>> final long end =3D System.currentTimeMillis(); >>>>>>>>=20 >>>>>>>> zk.close(); >>>>>>>>=20 >>>>>>>> final long endClose =3D System.currentTimeMillis(); >>>>>>>>=20 >>>>>>>> System.out.println(testName+": ExeTime=3D " + (end - start) ); >>>>>>>> } >>>>>>>>=20 >>>>>>>> public static void main(String... args) throws Throwable { >>>>>>>> test(args[0], args[1]); >>>>>>>> } >>>>>>>> } >>>>>>>=20 >>>>>>=20 >>>>=20 >>>> flavio >>>> junqueira >>>>=20 >>>> research scientist >>>>=20 >>>> fpj@yahoo-inc.com >>>> direct +34 93-183-8828 >>>>=20 >>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>>> phone (408) 349 3300 fax (408) 349 3301 >>>>=20 >>>>=20 >>=20 >> flavio >> junqueira >>=20 >> research scientist >>=20 >> fpj@yahoo-inc.com >> direct +34 93-183-8828 >>=20 >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >>=20 >>=20 flavio junqueira =20 research scientist =20 fpj@yahoo-inc.com direct +34 93-183-8828 =20 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301 --Apple-Mail-6--845030409--