zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Weisberg <aweisb...@voltdb.com>
Subject Re: sync vs. async vs. multi performances
Date Sat, 18 Feb 2012 15:17:47 GMT
Hi,

In that diagram, what is the difference between net, net + disk, and net +
disk (no write cache)?

Thanks,
Ariel

On Fri, Feb 17, 2012 at 3:41 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:

> Hi Ariel, That wiki is stale. Check it here:
>
>
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentations
>
> In particular check the HIC talk, slide 57. We were using 1k byte writes
> for those tests.
>
> -Flavio
>
> On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote:
>
> > Hi,
> >
> > I tried to look at the presentations on the wiki, but the links aren't
> > working? I was using
> > http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and the
> > error at the top of the page is "You are not allowed to do AttachFile on
> > this page. Login and try again."
> >
> > I used (http://pastebin.com/uu7igM3J) and the results for 4k writes were
> > http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit slower
> than
> > 5. Is it possible to beat the rotation speed?
> >
> > You can increase the write size quite a bit to 240k and it only goes up
> to
> > 10 milliseconds. http://pastebin.com/MSTwaHYN
> >
> > My recollection was being in the 12-14 range, but I may be thinking of
> when
> > I was pushing throughput.
> >
> > Ariel
> >
> > On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira <fpj@yahoo-inc.com>
> wrote:
> >
> >> Some of our previous measurements gave us around 5ms, check some of our
> >> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not
> >> only volatile storage or battery backed cache. We do have the write
> cache
> >> on for the numbers I'm referring to. There are also numbers there when
> the
> >> write cache is off.
> >>
> >> -Flavio
> >>
> >> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:
> >>
> >>> Hi,
> >>>
> >>> It's only a minute of you process each region serially. Process 100 or
> >> 1000
> >>> in parallel and it will go a lot faster.
> >>>
> >>> 20 milliseconds to synchronously commit to a 5.4k disk is about right.
> >> This
> >>> is assuming the configuration for this is correct. On ext3 you need to
> >>> mount with barrier=1 (ext4, xfs enable write barriers by default). If
> >>> someone is getting significantly faster numbers they are probably
> writing
> >>> to a volatile or battery backed cache.
> >>>
> >>> Performance is relative. The number of operations the DB can do is
> >> roughly
> >>> constant although multi may be able to more efficiently batch
> operations
> >> by
> >>> amortizing all the coordination overhead.
> >>>
> >>> In the synchronous case the DB is starved for work %99 of the time so
> it
> >> is
> >>> not surprising that it is slow. You are benchmarking round trip time in
> >>> that case, and that is dominated by the time it takes to synchronously
> >>> commmit something to disk.
> >>>
> >>> In the asynchronous case there is plenty of work and you can fully
> >> utilize
> >>> all the throughput available to get it done because each fsync makes
> >>> multiple operations durable. However the work is still presented
> >> piecemeal
> >>> so there is per-operation overhead.
> >>>
> >>> Caveat, I am on 3.3.3 so I haven't read how multi operations are
> >>> implemented, but the numbers you are getting bear this out. In the
> >>> multi-case you are getting the benefit of keeping the DB fully utilized
> >>> plus amortizing the coordination overhead across multiple operations so
> >> you
> >>> get a boost in throughput beyond just async.
> >>>
> >>> Ariel
> >>>
> >>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <nkeywal@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thanks for the replies.
> >>>>
> >>>> It's used when assigning the regions (kind of dataset) to the
> >> regionserver
> >>>> (jvm process in a physical server). There is one zookeeper node per
> >> region.
> >>>> On a server failure, there is typically a few hundreds regions to
> >> reassign,
> >>>> with multiple status written in . On paper, if we need 0,02s per node,
> >> that
> >>>> makes it to the minute to recover, just for zookeeper.
> >>>>
> >>>> That's theory. I haven't done a precise measurement yet.
> >>>>
> >>>>
> >>>> Anyway, if ZooKeeper can be faster, it's always very interesting :-)
> >>>>
> >>>>
> >>>> Cheers,
> >>>>
> >>>> N.
> >>>>
> >>>>
> >>>> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <ted.dunning@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> These results are about what is expected although the might be a
> little
> >>>>> more extreme.
> >>>>>
> >>>>> I doubt very much that hbase is mutating zk nodes fast enough for
> this
> >> to
> >>>>> matter much.
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>> On Feb 14, 2012, at 8:00, N Keywal <nkeywal@gmail.com> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've done a test with Zookeeper 3.4.2 to compare the performances
of
> >>>>>> synchronous vs. asynchronous vs. multi when creating znode
> (variations
> >>>>>> around:
> >>>>>> calling 10000 times zk.create("/dummyTest", "dummy".getBytes(),
> >>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The code
is at
> >>>> the
> >>>>>> end of the mail.
> >>>>>>
> >>>>>> I've tested different environments:
> >>>>>> - 1 linux server with the client and 1 zookeeper node on the
same
> >>>> machine
> >>>>>> - 1 linux server for the client, 1 for 1 zookeeper node.
> >>>>>> - 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes.
> >>>>>>
> >>>>>> Server are middle range, with 4*2 cores, jdk 1.6. ZK was on
its own
> >> HD.
> >>>>>>
> >>>>>> But the results are comparable:
> >>>>>>
> >>>>>> Using the sync API, it takes 200 seconds for 10K creations,
so
> around
> >>>>> 0.02
> >>>>>> second per call.
> >>>>>> Using the async API, it takes 2 seconds for 10K (including waiting
> for
> >>>>> the
> >>>>>> last callback message)
> >>>>>> Using the "multi" available since 3.4, it takes less than 1
second,
> >>>> again
> >>>>>> for 10K.
> >>>>>>
> >>>>>> I'm surprised by the time taken by the sync operation, I was
not
> >>>>> expecting
> >>>>>> it to be that slow. The gap between async & sync is quite
huge.
> >>>>>>
> >>>>>> Is this something expected? Zookeeper is used in critical functions
> in
> >>>>>> Hadoop/Hbase, I was looking at the possible benefits of using
> "multi",
> >>>>> but
> >>>>>> it seems low compared to async (well ~3 times faster :-). There
are
> >>>> many
> >>>>>> small data creations/deletions with the sync API in the existing
> hbase
> >>>>>> algorithms, it would not be simple to replace them all by
> asynchronous
> >>>>>> calls...
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> N.
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> public class ZookeeperTest {
> >>>>>> static ZooKeeper zk;
> >>>>>> static int nbTests = 10000;
> >>>>>>
> >>>>>> private ZookeeperTest() {
> >>>>>> }
> >>>>>>
> >>>>>> public static void test11() throws Exception {
> >>>>>>  for (int i = 0; i < nbTests; ++i) {
> >>>>>>    zk.create("/dummyTest_" + i, "dummy".getBytes(),
> >>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> >>>>>>  }
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> public static void test51() throws Exception {
> >>>>>>  final AtomicInteger counter = new AtomicInteger(0);
> >>>>>>
> >>>>>>  for (int i = 0; i < nbTests; ++i) {
> >>>>>>    zk.create("/dummyTest_" + i, "dummy".getBytes(),
> >>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT,
> >>>>>>      new AsyncCallback.StringCallback() {
> >>>>>>        public void processResult(int i, String s, Object o,
String
> >>>> s1)
> >>>>> {
> >>>>>>          counter.incrementAndGet();
> >>>>>>        }
> >>>>>>      }
> >>>>>>      , null);
> >>>>>>  }
> >>>>>>
> >>>>>>  while (counter.get() != nbTests) {
> >>>>>>    Thread.sleep(1);
> >>>>>>  }
> >>>>>> }
> >>>>>>
> >>>>>> public static void test41() throws Exception {
> >>>>>>  ArrayList<Op> ops = new ArrayList<Op>(nbTests);
> >>>>>>  for (int i = 0; i < nbTests; ++i) {
> >>>>>>    ops.add(
> >>>>>>      Op.create("/dummyTest_" + i, "dummy".getBytes(),
> >>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
> >>>>>>    );
> >>>>>>  }
> >>>>>>
> >>>>>>  zk.multi(ops);
> >>>>>> }
> >>>>>>
> >>>>>> public static void delete() throws Exception {
> >>>>>>  ArrayList<Op> ops = new ArrayList<Op>(nbTests);
> >>>>>>
> >>>>>>  for (int i = 0; i < nbTests; ++i) {
> >>>>>>    ops.add(
> >>>>>>      Op.delete("/dummyTest_" + i,-1)
> >>>>>>    );
> >>>>>>  }
> >>>>>>
> >>>>>>  zk.multi(ops);
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> public static void test(String connection, String testName)
throws
> >>>>>> Throwable{
> >>>>>>  Method m = ZookeeperTest.class.getMethod(testName);
> >>>>>>
> >>>>>>  zk = new ZooKeeper(connection, 20000, new Watcher() {
> >>>>>>    public void process(WatchedEvent watchedEvent) {
> >>>>>>    }
> >>>>>>  });
> >>>>>>
> >>>>>>  final long start = System.currentTimeMillis();
> >>>>>>
> >>>>>>  try {
> >>>>>>    m.invoke(null);
> >>>>>>  } catch (IllegalAccessException e) {
> >>>>>>    throw e;
> >>>>>>  } catch (InvocationTargetException e) {
> >>>>>>    throw e.getTargetException();
> >>>>>>  }
> >>>>>>
> >>>>>>  final long end = System.currentTimeMillis();
> >>>>>>
> >>>>>>  zk.close();
> >>>>>>
> >>>>>>  final long endClose = System.currentTimeMillis();
> >>>>>>
> >>>>>>  System.out.println(testName+":  ExeTime= " + (end - start)
);
> >>>>>> }
> >>>>>>
> >>>>>> public static void main(String... args) throws Throwable {
> >>>>>>    test(args[0], args[1]);
> >>>>>> }
> >>>>>> }
> >>>>>
> >>>>
> >>
> >> flavio
> >> junqueira
> >>
> >> research scientist
> >>
> >> fpj@yahoo-inc.com
> >> direct +34 93-183-8828
> >>
> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> >> phone (408) 349 3300    fax (408) 349 3301
> >>
> >>
>
> flavio
> junqueira
>
> research scientist
>
> fpj@yahoo-inc.com
> direct +34 93-183-8828
>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message