Return-Path: Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: (qmail 67372 invoked from network); 16 Apr 2011 06:26:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Apr 2011 06:26:16 -0000 Received: (qmail 57248 invoked by uid 500); 16 Apr 2011 06:26:15 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 57051 invoked by uid 500); 16 Apr 2011 06:26:15 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 56851 invoked by uid 99); 16 Apr 2011 06:26:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Apr 2011 06:26:15 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tru64ufs@me.com designates 17.148.16.100 as permitted sender) Received: from [17.148.16.100] (HELO asmtpout025.mac.com) (17.148.16.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Apr 2011 06:26:09 +0000 MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_U5xnuGzFjmGmqTbM046asQ)" Received: from [10.0.1.7] ([211.198.93.29]) by asmtp025.mac.com (Oracle Communications Messaging Exchange Server 7u4-20.01 64bit (built Nov 21 2010)) with ESMTPSA id <0LJQ002TOF5PNC10@asmtp025.mac.com> for user@zookeeper.apache.org; Fri, 15 Apr 2011 23:25:07 -0700 (PDT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.2.15,1.0.148,0.0.0000 definitions=2011-04-15_08:2011-04-15,2011-04-15,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 suspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1012030000 definitions=main-1104150215 Subject: Re: Serious problem processing hearbeat on login stampede From: Chang Song In-reply-to: Date: Sat, 16 Apr 2011 15:25:17 +0900 Cc: user@zookeeper.apache.org, Benjamin Reed , Patrick Hunt , zookeeper-user@hadoop.apache.org Message-id: <79019AFE-1940-48E5-9924-885291AD2C2C@me.com> References: <5E0007CB-D9FA-42D1-9D41-25DD27B20C43@me.com> <2AAFE07C-9586-4E19-A608-9710A4154640@me.com> <66AAB7DE-7513-45E9-AA46-EDBC97629A50@me.com> <3B919AD3-47C9-44D7-A04A-65DFFC8B0CAC@me.com> To: Ted Dunning X-Mailer: Apple Mail (2.1084) --Boundary_(ID_U5xnuGzFjmGmqTbM046asQ) Content-type: text/plain; charset=euc-kr Content-transfer-encoding: quoted-printable 2011. 4. 16., =BF=C0=C8=C4 2:21, Ted Dunning =C0=DB=BC=BA: > You know, I think it would help if you would answer some of the = questions that people have posed. >=20 > You say that it takes 1000 clients over 8 seconds to register. That = is about 100 transactions per second. Ted. Sorry. Real reproducing scenario isn't what I mentioned initially. It is not login, it is session expiring and closing process. I know, we had test ZK many times well above this in our environment, = and saw no problem. So sorry about confusion. > That is two orders of magnitude slower than others have observed ZK to = be. This is a really big difference. >=20 > So there is a big discrepancy here. I am not saying you didn't = observe what you say, but I do think that there is something that you = haven't mentioned because you haven't noticed it yet. If you go through = the questions people have asked and answer them, there is a good chance = you will notice something that is causing your problems. There is = likely to be a problem in the way that you have set up your machines. >=20 > One pending question is whether you have separate log and snapshot = disks. Do you? I have already answered this. I have no separate disk. We have one filesystem mount point with RAID1 disks. > Another is whether you have other processes running on the disk. Are = there? Our ZK ensemble server are dedicated to ZK ensemble only > Another is a request that you post some of the output of iostat with 5 = second sampling rate. Can you post that output? I will. It will be on Monday though. But please note that I used to be a kernel filesystem engineer, and I = know how to read iostat ;) > There are others questions that you will find in the email history. >=20 > Remember, people answering your questions here are doing so because = they are nice and because they like to build a sense of community. But = to get a lot from them, you need to work with them. Please let me know if there are questions to be answered I will try to update JIRA with answers in these emails. Thank you. >=20 > 2011/4/15 Chang Song >=20 > I have file a JIRA bug >=20 > https://issues.apache.org/jira/browse/ZOOKEEPER-1049 >=20 >=20 > We have measured I/O wait again, but found no IO activity due to ZK. > Just regular page cache sync daemon in the work: 0-3%. >=20 > I will have my team to attach ZK stat result. >=20 > Thanks a lot. > Let's move this discussion to JIRA >=20 >=20 > 2011. 4. 15., =BF=C0=C0=FC 7:34, Ted Dunning =C0=DB=BC=BA: >=20 > > You said that, but there was some skepticism from others about this. > > > > You need to try the monitoring that was suggested. 5 minute = averages are > > not useful. > > > > What does the stat four letter command return? ( > > = http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands = ) > > > > 2011/4/14 Chang Song > > > >> 2. We have a boot disk and usr disk. > >> But as I said, disk I/O is not an issue that's causing 8 second = delay. > >> >=20 >=20 --Boundary_(ID_U5xnuGzFjmGmqTbM046asQ)--