Return-Path: X-Original-To: apmail-bookkeeper-user-archive@www.apache.org Delivered-To: apmail-bookkeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EC2401839E for ; Wed, 10 Jun 2015 18:07:14 +0000 (UTC) Received: (qmail 29838 invoked by uid 500); 10 Jun 2015 18:07:14 -0000 Delivered-To: apmail-bookkeeper-user-archive@bookkeeper.apache.org Received: (qmail 29790 invoked by uid 500); 10 Jun 2015 18:07:14 -0000 Mailing-List: contact user-help@bookkeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@bookkeeper.apache.org Delivered-To: mailing list user@bookkeeper.apache.org Received: (qmail 29780 invoked by uid 99); 10 Jun 2015 18:07:14 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 18:07:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 5AE8D1A4A29 for ; Wed, 10 Jun 2015 18:07:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.879 X-Spam-Level: *** X-Spam-Status: No, score=3.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id L_rIFZkEGh8R for ; Wed, 10 Jun 2015 18:07:09 +0000 (UTC) Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id DB2344543A for ; Wed, 10 Jun 2015 18:07:08 +0000 (UTC) Received: by wgez8 with SMTP id z8so41748471wge.0 for ; Wed, 10 Jun 2015 11:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=niIzAYa42M5IvN9dsQ8kI1D2u4NFk/2EbGaXG7KuLuc=; b=TIyjMy8rsDjhOZV+21Gh8QmaNS7qqQS0AsQmysAeVsYBqg9GOqBenARtqVCuEQ7swt kWYYsg42mj2erl0PJHBg3o52BBNKuwBC6/Nt62o2PhLGwwVOsHah2uLm57PyIRGqQ9eF 58MmwWCUCRjd1J2Kq/7vXydQ9NixCqGo9tvow+XkZtNdvGdCux+Ar86/JyRQk1oErLen izC3fvIpXKcyvljBWoF3dhCiX1WmMsYFKSUkzCLQ//i6V8Kk7Uen9ma2/HUmdmlTTJI5 eMlp6B4XQceWknOEC0cqgqqPIV1LT2Ca1ySkEXFSbUMqGU2JNcK5yHtJxrbl+IMpyQRU +72A== MIME-Version: 1.0 X-Received: by 10.180.73.230 with SMTP id o6mr11031535wiv.11.1433959628109; Wed, 10 Jun 2015 11:07:08 -0700 (PDT) Received: by 10.28.216.18 with HTTP; Wed, 10 Jun 2015 11:07:08 -0700 (PDT) In-Reply-To: References: Date: Wed, 10 Jun 2015 20:07:08 +0200 Message-ID: Subject: Re: Low write bandwidth From: =?UTF-8?Q?Maciej_Smole=C5=84ski?= To: user , robindh Cc: Flavio Junqueira Content-Type: multipart/alternative; boundary=f46d0435c018417e9305182dbe45 --f46d0435c018417e9305182dbe45 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thank You for Your comments and explanations. If I understand correctly processing entry as a unit might be longer as it requires to wait for all tcp framents that carry single entry, and write is performed after that. It looks like it introduce extra latency and might be the reason of not saturated bandwidth. I will try to confirm this. Kind regards, Maciej On Wed, Jun 10, 2015 at 6:09 PM, Robin Dhamankar wrote= : > Flavio, that's right, we don't stream entries so the fully entry is > processed as a unit. And as such network transfer of entry and its write = to > disk dont overlap. > > Essentially here throughput is bounded by the latency of each add request= . > As such for a fixed entry size, we cannot necessarily saturate I/O > bandwidth (network or disk bandwidth) unless we can lower latencies. > > Maciej, Multiple packets can be in flight at the same time as such the > latency is not strictly additive. The 1.39ms gives you a rough upper boun= d > of the network latency and then there is the request processing latency o= n > the bookie (as these two phases do not overlap as I explained above). For > ramfs request processing should have low latency. We should probably > measure that as well > > Thanks- > > > > On Wed, Jun 10, 2015 at 8:38 AM, Maciej Smole=C5=84ski > wrote: > >> I run ping -s 65000 and the results are below. >> Latency is always <1.5 ms. >> Does it mean that for transporting single entry two packets will be used >> and the latency will be: 2.5 ms (1.5 ms for (65K) and 1 ms for (35K) =3D= > 2.5 >> ms for 100K) ? >> Is it possible to improve this ? Is it possible to increase packet size, >> so that single entry fits single packet ? >> >> >> >> ping/from_client_to_server1 >> PING SN0101 (169.254.1.31) 65000(65028) bytes of data. >> 65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D1 ttl=3D64 time=3D1.3= 9 ms >> 65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D2 ttl=3D64 time=3D1.2= 9 ms >> 65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D3 ttl=3D64 time=3D1.2= 9 ms >> 65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D4 ttl=3D64 time=3D1.3= 1 ms >> 65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D5 ttl=3D64 time=3D1.3= 2 ms >> >> ping/from_client_to_server2 >> PING SN0102 (169.254.1.32) 65000(65028) bytes of data. >> 65008 bytes from SN0102 (169.254.1.32): icmp_seq=3D1 ttl=3D64 time=3D1.2= 6 ms >> 65008 bytes from SN0102 (169.254.1.32): icmp_seq=3D2 ttl=3D64 time=3D1.3= 1 ms >> 65008 bytes from SN0102 (169.254.1.32): icmp_seq=3D3 ttl=3D64 time=3D1.1= 2 ms >> 65008 bytes from SN0102 (169.254.1.32): icmp_seq=3D4 ttl=3D64 time=3D1.2= 7 ms >> 65008 bytes from SN0102 (169.254.1.32): icmp_seq=3D5 ttl=3D64 time=3D1.3= 7 ms >> >> ping/from_client_to_server3 >> PING SN0103 (169.254.1.33) 65000(65028) bytes of data. >> 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D1 ttl=3D64 time=3D1.2= 5 ms >> 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D2 ttl=3D64 time=3D1.3= 8 ms >> 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D3 ttl=3D64 time=3D1.2= 5 ms >> 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D4 ttl=3D64 time=3D1.3= 3 ms >> 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D5 ttl=3D64 time=3D1.3= 2 ms >> >> ping/from_server1_to_client >> PING AN0101 (169.254.1.11) 65000(65028) bytes of data. >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D1 ttl=3D64 time=3D1.0= 1 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D2 ttl=3D64 time=3D1.3= 8 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D3 ttl=3D64 time=3D1.3= 5 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D4 ttl=3D64 time=3D1.3= 5 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D5 ttl=3D64 time=3D1.3= 2 ms >> >> ping/from_server2_to_client >> PING AN0101 (169.254.1.11) 65000(65028) bytes of data. >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D1 ttl=3D64 time=3D0.8= 87 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D2 ttl=3D64 time=3D1.3= 1 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D3 ttl=3D64 time=3D1.3= 2 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D4 ttl=3D64 time=3D0.9= 98 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D5 ttl=3D64 time=3D1.2= 2 ms >> >> ping/from_server3_to_client >> PING AN0101 (169.254.1.11) 65000(65028) bytes of data. >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D1 ttl=3D64 time=3D1.0= 8 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D2 ttl=3D64 time=3D1.4= 0 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D3 ttl=3D64 time=3D1.0= 7 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D4 ttl=3D64 time=3D1.2= 6 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D5 ttl=3D64 time=3D1.2= 6 ms >> 65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D6 ttl=3D64 time=3D1.2= 6 ms >> >> On Wed, Jun 10, 2015 at 4:45 PM, Aniruddha Laud > > wrote: >> >>> >>> >>> On Wed, Jun 10, 2015 at 7:00 AM, Maciej Smole=C5=84ski >>> wrote: >>> >>>> Thank You for Your comment. >>>> >>>> Unfortunately, these option will not help in my case. >>>> In my case BookKeeper client will receive next request when previous >>>> request is confirmed. >>>> It is expected also that there will be only single stream of such >>>> requests. >>>> >>>> I would like to understand how to achieve performance equal to the >>>> network bandwidth. >>>> >>> >>> to saturate bandwidth, you will have to have more than one outstanding >>> request. 250 requests/second gives you 4ms per request. With each entry >>> 100K in size, that's not unreasonable. My suggestion would be to monito= r >>> the write latency from the client to the server. >>> >>> ping -s 65000 should give you a baseline for what to expect with >>> latencies. >>> >>> With 100K packets, you are going to see fragmentation at both the IP an= d >>> the Ethernet layer. That wasn't the case with 1K payload. >>> >>> How many hops does one need to go from one machine to another? - higher >>> the hops, higher the latency >>> >>> >>>> >>>> >>>> On Wed, Jun 10, 2015 at 2:27 PM, Flavio Junqueira < >>>> fpjunqueira@yahoo.com> wrote: >>>> >>>>> BK currently isn't wired to stream bytes to a ledger, so writing >>>>> synchronously large entries as you're doing is likely not to get the = best >>>>> its performance. A couple of things you could try to get higher perfo= rmance >>>>> are to write asynchronously and to have multiple clients writing. >>>>> >>>>> -Flavio >>>>> >>>>> >>>>> >>>>> >>>>> On Wednesday, June 10, 2015 12:08 PM, Maciej Smole=C5=84ski < >>>>> jezdnia@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> Hello, >>>>> >>>>> I'm testing BK performance when appending 100K entries synchronously >>>>> from 1 thread (using one ledger). >>>>> The performance I get is 250 entries/s. >>>>> >>>>> What performance should I expect ? >>>>> >>>>> My setup: >>>>> >>>>> Ledger: >>>>> Ensemble size: 3 >>>>> Quorum size: 2 >>>>> >>>>> 1 client machine and 3 server machines. >>>>> >>>>> Network: >>>>> Each machine with bonding: 4 x 1000Mbps on each machine >>>>> manually tested between client and server: 400MB/s >>>>> >>>>> Disk: >>>>> I tested two configurations: >>>>> dedicated disks with ext3 (different for zookeeper, journal, data, >>>>> index, log) >>>>> dedicated ramfs partitions (different for zookeeper, journal, data, >>>>> index, log) >>>>> >>>>> In both configurations the performance is the same: 250 entries / s >>>>> (25MB / s). >>>>> I confirmed this with measured network bandwidth: >>>>> - on client 50 MB/s >>>>> - on server 17 MB/s >>>>> >>>>> I run java with profiler enabled on BK client and BK server but didn'= t >>>>> find anything unexpected (but I don't know bookkeeper internals). >>>>> >>>>> I tested it with two BookKeeper versions: >>>>> - 4.3.0 >>>>> - 4.2.2 >>>>> The result were the same with both BookKeeper versions. >>>>> >>>>> What should be changed/checked to get better performance ? >>>>> >>>>> Kind regards, >>>>> Maciej >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > --f46d0435c018417e9305182dbe45 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thank You for Your comments and explan= ations.
If I understand correctly processing entry as a unit might= be longer as it requires to wait for all tcp framents that carry single en= try, and write is performed after that.
It looks like it intr= oduce extra latency and might be the reason of not saturated bandwidth.
=
I will try to confirm this.

Kind regards,
=
Maciej


On Wed, Jun 10, 2015 at 6:09 PM, = Robin Dhamankar <robindh@apache.org> wrote:
Flavio, that's right, we don&#= 39;t stream entries so the fully entry is=20 processed as a unit. And as such network transfer of entry and its write to disk dont overlap.

Essentially here throughput is=20 bounded by the latency of each add request. As such for a fixed entry=20 size, we cannot necessarily saturate I/O bandwidth (network or disk=20 bandwidth) unless we can lower latencies.

Maciej, Multipl= e packets can be in flight at the same time as such the latency is not stri= ctly additive. The 1.39ms gives you a rough upper bound of the network late= ncy and then there is the request processing latency on the bookie (as thes= e two phases do not overlap as I explained above). For ramfs request proces= sing should have low latency. We should probably measure that as well
Thanks-



On Wed, Jun 10, 2015 at 8:= 38 AM, Maciej Smole=C5=84ski <jezdnia@gmail.com> wrote:
<= /span>
I run ping -s 65000 and the results are below.
Latency is al= ways <1.5 ms.
Does it mean that for transporting single en= try two packets will be used and the latency will be: 2.5 ms (1.5 ms for (6= 5K) and 1 ms for (35K) =3D> 2.5 ms for 100K) ?
Is it possible to improve this ? Is it possible to increase pa= cket size, so that single entry fits single packet ?


=

ping/from_client_to_server1PING SN0101 (169.254.1.31) 65000(65028) bytes of data.
65008 bytes from= SN0101 (169.254.1.31): icmp_seq=3D1 ttl=3D64 time=3D1.39 ms
65008 bytes= from SN0101 (169.254.1.31): icmp_seq=3D2 ttl=3D64 time=3D1.29 ms
65008 = bytes from SN0101 (169.254.1.31): icmp_seq=3D3 ttl=3D64 time=3D1.29 ms
6= 5008 bytes from SN0101 (169.254.1.31): icmp_seq=3D4 ttl=3D64 time=3D1.31 ms=
65008 bytes from SN0101 (169.254.1.31): icmp_seq=3D5 ttl=3D64 time=3D1.= 32 ms

ping/from_client_to_server2
PING SN0102 (169.254.1.32) 6500= 0(65028) bytes of data.
65008 bytes from SN0102 (169.254.1.32): icmp_seq= =3D1 ttl=3D64 time=3D1.26 ms
65008 bytes from SN0102 (169.254.1.32): icm= p_seq=3D2 ttl=3D64 time=3D1.31 ms
65008 bytes from SN0102 (169.254.1.32)= : icmp_seq=3D3 ttl=3D64 time=3D1.12 ms
65008 bytes from SN0102 (169.254.= 1.32): icmp_seq=3D4 ttl=3D64 time=3D1.27 ms
65008 bytes from SN0102 (169= .254.1.32): icmp_seq=3D5 ttl=3D64 time=3D1.37 ms

ping/from_client_to= _server3
PING SN0103 (169.254.1.33) 65000(65028) bytes of data.
65008= bytes from SN0103 (169.254.1.33): icmp_seq=3D1 ttl=3D64 time=3D1.25 ms
= 65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D2 ttl=3D64 time=3D1.38 m= s
65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D3 ttl=3D64 time=3D1= .25 ms
65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D4 ttl=3D64 tim= e=3D1.33 ms
65008 bytes from SN0103 (169.254.1.33): icmp_seq=3D5 ttl=3D6= 4 time=3D1.32 ms

ping/from_server1_to_client
PING AN0101 (169.254= .1.11) 65000(65028) bytes of data.
65008 bytes from AN0101 (169.254.1.11= ): icmp_seq=3D1 ttl=3D64 time=3D1.01 ms
65008 bytes from AN0101 (169.254= .1.11): icmp_seq=3D2 ttl=3D64 time=3D1.38 ms
65008 bytes from AN0101 (16= 9.254.1.11): icmp_seq=3D3 ttl=3D64 time=3D1.35 ms
65008 bytes from AN010= 1 (169.254.1.11): icmp_seq=3D4 ttl=3D64 time=3D1.35 ms
65008 bytes from = AN0101 (169.254.1.11): icmp_seq=3D5 ttl=3D64 time=3D1.32 ms

ping/fro= m_server2_to_client
PING AN0101 (169.254.1.11) 65000(65028) bytes of dat= a.
65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D1 ttl=3D64 time=3D= 0.887 ms
65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D2 ttl=3D64 t= ime=3D1.31 ms
65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D3 ttl= =3D64 time=3D1.32 ms
65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D= 4 ttl=3D64 time=3D0.998 ms
65008 bytes from AN0101 (169.254.1.11): icmp_= seq=3D5 ttl=3D64 time=3D1.22 ms

ping/from_server3_to_client
PING = AN0101 (169.254.1.11) 65000(65028) bytes of data.
65008 bytes from AN010= 1 (169.254.1.11): icmp_seq=3D1 ttl=3D64 time=3D1.08 ms
65008 bytes from = AN0101 (169.254.1.11): icmp_seq=3D2 ttl=3D64 time=3D1.40 ms
65008 bytes = from AN0101 (169.254.1.11): icmp_seq=3D3 ttl=3D64 time=3D1.07 ms
65008 b= ytes from AN0101 (169.254.1.11): icmp_seq=3D4 ttl=3D64 time=3D1.26 ms
65= 008 bytes from AN0101 (169.254.1.11): icmp_seq=3D5 ttl=3D64 time=3D1.26 ms<= br>65008 bytes from AN0101 (169.254.1.11): icmp_seq=3D6 ttl=3D64 time=3D1.2= 6 ms

On Wed, Jun 10, 2015 at 4:4= 5 PM, Aniruddha Laud <trojan.of.troy@gmail.com> wrote= :


On Wed, Jun 10, 2015 at 7:00 = AM, Maciej Smole=C5=84ski <jezdnia@gmail.com> wrote:
Thank You for Your comment= .

Unfortunately, these option will not help in my case.
In my case BookKeeper client will receive next request= when previous request is confirmed.
It is expec= ted also that there will be only single stream of such requests.

I would like to understand how to achieve performance = equal to the network bandwidth.
=C2= =A0
to saturate bandwidth, you will have to have more than= one outstanding request. 250 requests/second gives you 4ms per request. Wi= th each entry 100K in size, that's not unreasonable. My suggestion woul= d be to monitor the write latency from the client to the server.=C2=A0

ping -s 65000 should give you a baseline for what to e= xpect with latencies.=C2=A0

With 100K packets, you= are going to see fragmentation at both the IP and the Ethernet layer. That= wasn't the case with 1K payload.=C2=A0

How ma= ny hops does one need to go from one machine to another? - higher the hops,= higher the latency


=C2=A0

On Wed, Jun 10, 2015 at 2:27 PM, Flavio Junqueira <fpjunqueira@yahoo.com> wrote:
BK currently isn't wired to stream bytes = to a ledger, so writing synchronously large entries as you're doing is = likely not to get the best its performance. A couple of things you could tr= y to get higher performance are to write asynchronously and to have multipl= e clients writing.=C2=A0

-Flavio
=




On Wednesday, June 10, 2015 12:08 PM, Maciej Smole=C5=84ski <jezdnia@gmail.com&g= t; wrote:


Hello,

I'm testing BK pe= rformance when appending 100K entries synchronously from 1 thread (using on= e ledger).
The performance I get is 250 entries/s.
<= div>
What performance should I expect ?

My setup:

Ledger:
Ensemble size: 3
Quorum size: 2

1 client machine and 3 server machines.=

Network:
Each machine with bonding: 4 x 10= 00Mbps on each machine
manually tested between client and ser= ver: 400MB/s

Disk:
I tested two configurati= ons:
dedicated disks with ext3 (different for zookeeper, jour= nal, data, index, log)
dedicated ramfs partitions (different = for zookeeper, journal, data, index, log)

In both configu= rations the performance is the same: 250 entries / s (25MB / s).
<= div>I confirmed this with measured network bandwidth:
- on cl= ient 50 MB/s
- on server 17 MB/s

I run java= with profiler enabled on BK client and BK server but didn't find anyth= ing unexpected (but I don't know bookkeeper internals).

I tested it with two BookKeeper versions:
- 4.3.0
- 4.2.2
The result were the same with both BookKeeper versions.

What should be changed/checked to get better performance ?

Kind regards,
Maciej


=


=C2=A0



<= br>





--f46d0435c018417e9305182dbe45--