zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: sync vs. async vs. multi performances
Date Sun, 19 Feb 2012 11:04:47 GMT
Hi Ariel, Here is what they mean:

	Net means the overhead of the replication protocol only, not writing to disk
	Net+disk means the overhead of the replication protocol with writes to disk enabled	
	Net+disk (no write cache) same as the previous one, and we have turned the write cache of
the disk off

-Flavio
 
On Feb 18, 2012, at 4:17 PM, Ariel Weisberg wrote:

> Hi,
> 
> In that diagram, what is the difference between net, net + disk, and net +
> disk (no write cache)?
> 
> Thanks,
> Ariel
> 
> On Fri, Feb 17, 2012 at 3:41 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> 
>> Hi Ariel, That wiki is stale. Check it here:
>> 
>> 
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentations
>> 
>> In particular check the HIC talk, slide 57. We were using 1k byte writes
>> for those tests.
>> 
>> -Flavio
>> 
>> On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote:
>> 
>>> Hi,
>>> 
>>> I tried to look at the presentations on the wiki, but the links aren't
>>> working? I was using
>>> http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and the
>>> error at the top of the page is "You are not allowed to do AttachFile on
>>> this page. Login and try again."
>>> 
>>> I used (http://pastebin.com/uu7igM3J) and the results for 4k writes were
>>> http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit slower
>> than
>>> 5. Is it possible to beat the rotation speed?
>>> 
>>> You can increase the write size quite a bit to 240k and it only goes up
>> to
>>> 10 milliseconds. http://pastebin.com/MSTwaHYN
>>> 
>>> My recollection was being in the 12-14 range, but I may be thinking of
>> when
>>> I was pushing throughput.
>>> 
>>> Ariel
>>> 
>>> On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira <fpj@yahoo-inc.com>
>> wrote:
>>> 
>>>> Some of our previous measurements gave us around 5ms, check some of our
>>>> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not
>>>> only volatile storage or battery backed cache. We do have the write
>> cache
>>>> on for the numbers I'm referring to. There are also numbers there when
>> the
>>>> write cache is off.
>>>> 
>>>> -Flavio
>>>> 
>>>> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> It's only a minute of you process each region serially. Process 100 or
>>>> 1000
>>>>> in parallel and it will go a lot faster.
>>>>> 
>>>>> 20 milliseconds to synchronously commit to a 5.4k disk is about right.
>>>> This
>>>>> is assuming the configuration for this is correct. On ext3 you need to
>>>>> mount with barrier=1 (ext4, xfs enable write barriers by default). If
>>>>> someone is getting significantly faster numbers they are probably
>> writing
>>>>> to a volatile or battery backed cache.
>>>>> 
>>>>> Performance is relative. The number of operations the DB can do is
>>>> roughly
>>>>> constant although multi may be able to more efficiently batch
>> operations
>>>> by
>>>>> amortizing all the coordination overhead.
>>>>> 
>>>>> In the synchronous case the DB is starved for work %99 of the time so
>> it
>>>> is
>>>>> not surprising that it is slow. You are benchmarking round trip time
in
>>>>> that case, and that is dominated by the time it takes to synchronously
>>>>> commmit something to disk.
>>>>> 
>>>>> In the asynchronous case there is plenty of work and you can fully
>>>> utilize
>>>>> all the throughput available to get it done because each fsync makes
>>>>> multiple operations durable. However the work is still presented
>>>> piecemeal
>>>>> so there is per-operation overhead.
>>>>> 
>>>>> Caveat, I am on 3.3.3 so I haven't read how multi operations are
>>>>> implemented, but the numbers you are getting bear this out. In the
>>>>> multi-case you are getting the benefit of keeping the DB fully utilized
>>>>> plus amortizing the coordination overhead across multiple operations
so
>>>> you
>>>>> get a boost in throughput beyond just async.
>>>>> 
>>>>> Ariel
>>>>> 
>>>>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <nkeywal@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Thanks for the replies.
>>>>>> 
>>>>>> It's used when assigning the regions (kind of dataset) to the
>>>> regionserver
>>>>>> (jvm process in a physical server). There is one zookeeper node per
>>>> region.
>>>>>> On a server failure, there is typically a few hundreds regions to
>>>> reassign,
>>>>>> with multiple status written in . On paper, if we need 0,02s per
node,
>>>> that
>>>>>> makes it to the minute to recover, just for zookeeper.
>>>>>> 
>>>>>> That's theory. I haven't done a precise measurement yet.
>>>>>> 
>>>>>> 
>>>>>> Anyway, if ZooKeeper can be faster, it's always very interesting
:-)
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> N.
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> These results are about what is expected although the might be
a
>> little
>>>>>>> more extreme.
>>>>>>> 
>>>>>>> I doubt very much that hbase is mutating zk nodes fast enough
for
>> this
>>>> to
>>>>>>> matter much.
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>> On Feb 14, 2012, at 8:00, N Keywal <nkeywal@gmail.com>
wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I've done a test with Zookeeper 3.4.2 to compare the performances
of
>>>>>>>> synchronous vs. asynchronous vs. multi when creating znode
>> (variations
>>>>>>>> around:
>>>>>>>> calling 10000 times zk.create("/dummyTest", "dummy".getBytes(),
>>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The
code is at
>>>>>> the
>>>>>>>> end of the mail.
>>>>>>>> 
>>>>>>>> I've tested different environments:
>>>>>>>> - 1 linux server with the client and 1 zookeeper node on
the same
>>>>>> machine
>>>>>>>> - 1 linux server for the client, 1 for 1 zookeeper node.
>>>>>>>> - 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes.
>>>>>>>> 
>>>>>>>> Server are middle range, with 4*2 cores, jdk 1.6. ZK was
on its own
>>>> HD.
>>>>>>>> 
>>>>>>>> But the results are comparable:
>>>>>>>> 
>>>>>>>> Using the sync API, it takes 200 seconds for 10K creations,
so
>> around
>>>>>>> 0.02
>>>>>>>> second per call.
>>>>>>>> Using the async API, it takes 2 seconds for 10K (including
waiting
>> for
>>>>>>> the
>>>>>>>> last callback message)
>>>>>>>> Using the "multi" available since 3.4, it takes less than
1 second,
>>>>>> again
>>>>>>>> for 10K.
>>>>>>>> 
>>>>>>>> I'm surprised by the time taken by the sync operation, I
was not
>>>>>>> expecting
>>>>>>>> it to be that slow. The gap between async & sync is quite
huge.
>>>>>>>> 
>>>>>>>> Is this something expected? Zookeeper is used in critical
functions
>> in
>>>>>>>> Hadoop/Hbase, I was looking at the possible benefits of using
>> "multi",
>>>>>>> but
>>>>>>>> it seems low compared to async (well ~3 times faster :-).
There are
>>>>>> many
>>>>>>>> small data creations/deletions with the sync API in the existing
>> hbase
>>>>>>>> algorithms, it would not be simple to replace them all by
>> asynchronous
>>>>>>>> calls...
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> N.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> public class ZookeeperTest {
>>>>>>>> static ZooKeeper zk;
>>>>>>>> static int nbTests = 10000;
>>>>>>>> 
>>>>>>>> private ZookeeperTest() {
>>>>>>>> }
>>>>>>>> 
>>>>>>>> public static void test11() throws Exception {
>>>>>>>> for (int i = 0; i < nbTests; ++i) {
>>>>>>>>   zk.create("/dummyTest_" + i, "dummy".getBytes(),
>>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> public static void test51() throws Exception {
>>>>>>>> final AtomicInteger counter = new AtomicInteger(0);
>>>>>>>> 
>>>>>>>> for (int i = 0; i < nbTests; ++i) {
>>>>>>>>   zk.create("/dummyTest_" + i, "dummy".getBytes(),
>>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT,
>>>>>>>>     new AsyncCallback.StringCallback() {
>>>>>>>>       public void processResult(int i, String s, Object o,
String
>>>>>> s1)
>>>>>>> {
>>>>>>>>         counter.incrementAndGet();
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>     , null);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> while (counter.get() != nbTests) {
>>>>>>>>   Thread.sleep(1);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> public static void test41() throws Exception {
>>>>>>>> ArrayList<Op> ops = new ArrayList<Op>(nbTests);
>>>>>>>> for (int i = 0; i < nbTests; ++i) {
>>>>>>>>   ops.add(
>>>>>>>>     Op.create("/dummyTest_" + i, "dummy".getBytes(),
>>>>>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
>>>>>>>>   );
>>>>>>>> }
>>>>>>>> 
>>>>>>>> zk.multi(ops);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> public static void delete() throws Exception {
>>>>>>>> ArrayList<Op> ops = new ArrayList<Op>(nbTests);
>>>>>>>> 
>>>>>>>> for (int i = 0; i < nbTests; ++i) {
>>>>>>>>   ops.add(
>>>>>>>>     Op.delete("/dummyTest_" + i,-1)
>>>>>>>>   );
>>>>>>>> }
>>>>>>>> 
>>>>>>>> zk.multi(ops);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> public static void test(String connection, String testName)
throws
>>>>>>>> Throwable{
>>>>>>>> Method m = ZookeeperTest.class.getMethod(testName);
>>>>>>>> 
>>>>>>>> zk = new ZooKeeper(connection, 20000, new Watcher() {
>>>>>>>>   public void process(WatchedEvent watchedEvent) {
>>>>>>>>   }
>>>>>>>> });
>>>>>>>> 
>>>>>>>> final long start = System.currentTimeMillis();
>>>>>>>> 
>>>>>>>> try {
>>>>>>>>   m.invoke(null);
>>>>>>>> } catch (IllegalAccessException e) {
>>>>>>>>   throw e;
>>>>>>>> } catch (InvocationTargetException e) {
>>>>>>>>   throw e.getTargetException();
>>>>>>>> }
>>>>>>>> 
>>>>>>>> final long end = System.currentTimeMillis();
>>>>>>>> 
>>>>>>>> zk.close();
>>>>>>>> 
>>>>>>>> final long endClose = System.currentTimeMillis();
>>>>>>>> 
>>>>>>>> System.out.println(testName+":  ExeTime= " + (end - start)
);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> public static void main(String... args) throws Throwable
{
>>>>>>>>   test(args[0], args[1]);
>>>>>>>> }
>>>>>>>> }
>>>>>>> 
>>>>>> 
>>>> 
>>>> flavio
>>>> junqueira
>>>> 
>>>> research scientist
>>>> 
>>>> fpj@yahoo-inc.com
>>>> direct +34 93-183-8828
>>>> 
>>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>>>> phone (408) 349 3300    fax (408) 349 3301
>>>> 
>>>> 
>> 
>> flavio
>> junqueira
>> 
>> research scientist
>> 
>> fpj@yahoo-inc.com
>> direct +34 93-183-8828
>> 
>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>> phone (408) 349 3300    fax (408) 349 3301
>> 
>> 

flavio
junqueira
 
research scientist
 
fpj@yahoo-inc.com
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message