Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 33000 invoked from network); 25 Nov 2009 04:22:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Nov 2009 04:22:34 -0000 Received: (qmail 77159 invoked by uid 500); 25 Nov 2009 04:22:33 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 77035 invoked by uid 500); 25 Nov 2009 04:22:30 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 77025 invoked by uid 99); 25 Nov 2009 04:22:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Nov 2009 04:22:29 +0000 X-ASF-Spam-Status: No, hits=-2.2 required=5.0 tests=BAYES_00,HTML_MESSAGE,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lvzheng19800619@gmail.com designates 209.85.222.198 as permitted sender) Received: from [209.85.222.198] (HELO mail-pz0-f198.google.com) (209.85.222.198) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Nov 2009 04:22:26 +0000 Received: by pzk36 with SMTP id 36so619486pzk.5 for ; Tue, 24 Nov 2009 20:22:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=q/brG1WJtGftHSUl1KPAi3bo+/I0bdTiPrYjK3vCmK8=; b=HCEHs4SZwdggz9SXlHVLkIpPhlcnLhSqT/+qiGeIQKDYf/qCic/aFqxKqTxGEgMYq9 pZy3F8tGERRFgQC4Fc5R5ms06SjeTkbfAeVkGkrq4EgRWxJUe3r3wrL3KRnxDVGszw6d wBelX4oNT78d6zY4uCBnZMrOjOLABpgDWb2cU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=D3kySdad4IIemMLYgaT1k6p5rOibFGgGK7lzNEGz0Zd6WotFH5a0awJofq6FsEdtgw B+GULmrxqYN5zRo0zZIf5MMtroFbF8m+r5vF8dD4oyfTPeU6y7Xhz0GYo0lQQUN4/GML AwNtaAfQeO2VHvFxF9YMuEeS3mERI8ShCx/+E= MIME-Version: 1.0 Received: by 10.114.2.12 with SMTP id 12mr2129959wab.52.1259122924990; Tue, 24 Nov 2009 20:22:04 -0800 (PST) In-Reply-To: <7c962aed0911241119x62b5ab28k76ed9c824cb42171@mail.gmail.com> References: <7c962aed0910100913q7a96de04i82dda858d4b993a0@mail.gmail.com> <7c962aed0910121722i2e5a8d68u3fc17b0971c13883@mail.gmail.com> <7c962aed0911201022s77004e3bq24f1a56a37cc702@mail.gmail.com> <7c962aed0911241030q425419d9g744ee511efd18c5a@mail.gmail.com> <7c962aed0911241119x62b5ab28k76ed9c824cb42171@mail.gmail.com> Date: Wed, 25 Nov 2009 12:22:04 +0800 Message-ID: Subject: Re: is there any problem with our environment? From: Zheng Lv To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=005045017683ac32ef04792a66c1 --005045017683ac32ef04792a66c1 Content-Type: text/plain; charset=ISO-8859-1 Hello Stack, >Sorry for taking time getting back to you Lv. Never mind:). >I posted Andrew's crawling notes yesterday. Did you see them? I thought >they might be of use to you. You mean you'v sent mail to me? I didn't get it. >This is interesting. You were just looking at message logs and saw the >above? We checked /var/log/messages on some servers, and we got many "link down" and "link up". >You don't need 6 zk nodes. Make it 3 (In an earlier mail you said you had 3 >only but above would seem to say you have 6 zk nodes). You could up the >ticktime from 2 to 3 seconds (We learned that maximum session time is >20*ticktime no matter what session timeout is set to). > We will change our cluster as you told me. >(In an earlier mail you said you had 3 only but above would seem to say you have 6 zk nodes) We added some, we thought 3 is insufficient. >For 8Gs of RAM and a single disk might not be sufficient to carry all of the >above daemons on a single node. You have cluster-wide monitoring setup? >Are these nodes swapping? From the logs I've seen, HDFS is for sure >struggling. Our system manager has just deployed mrtg+snmp on those servers, and we will share the data with you. >The crawler writes straight into hbase or does it write the >local disk? Each crawler has a cache in memory, when it is full, the crawler will flush webpages in cache to hbase. >What is the tasktracker doing? Are mapreduce jobs running >concurrently on these nodes? Now the tasktracker process is running, but there is not any job on it. But, when updater, which is another part of our product, is on there will be. We will restart hbase0.20.2 with DEBUG enabled and monitor it, I'll be back when we get useful data. Regards, Lv Zheng. 2009/11/25 stack > I took a look at your logs. Please run with DEBUG enabled going forward > (if > you upgraded to 0.20.2, then it'll be on by default). > > HDFS is struggling. > > Looking at this region: > > > http://blog.blog.tianya.cn/blogger/post_Date.asp?blogID=1818563&CategoryID=1256944&idWriter=0&Key=0&NextPostID=888888888&PageNo=2 > > ... there seems to be an issue w/ accounting. It looks like it got > assigned > to 13 before the logs you gave me but then the regionserver its on, while > it > thinks its carrying it in one regard -- its trying to flush it so it can > let > go of old WAL files -- it then is telling clients that ask for it that its > not serving it. Seeing an earlier master and regionserver 13 log would > help > figure what happened. Try restart on 0.20.2. Tell me more about the > loading on these machines, how its done. Do you have monitoring software > running? > > Thanks, > St.Ack > > > > > > On Tue, Nov 24, 2009 at 10:30 AM, stack wrote: > > > Sorry for taking time getting back to you Lv. > > > > I posted Andrew's crawling notes yesterday. Did you see them? I thought > > they might be of use to you. > > > > > > On Sun, Nov 22, 2009 at 10:30 PM, Zheng Lv >wrote: > > > >> >How did you fix it? > >> We found some "link down" and "link up" in some server's syslog, our > >> system > >> manager suggested to change the network cards on those servers, so we > >> changed them, and the errors disappeared. > >> > > > > > > This is interesting. You were just looking at message logs and saw the > > above? > > > > > > > >> > >> > >> >So figure out whats happening to that region by grepping its name in > the > >> >master log. Why is it offline so long? Are machines loaded? > Swapping? > >> > >> >Yeah, its split. Thats normal. Whats not normal is the client not > >> finding > >> >the daughter split in its new location. Did the daughters get deployed > >> >promptly? > >> I'm sorry, I don't know exactly how I can get these infomation from the > >> log. > >> Can you tell me? And I have upload the logs to sky drives > >> > >> > http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public/1123?uc=1&isFromRichUpload=1&lc=2052 > >> . > >> There are two master logs, which are from different days, and two > >> regionserver logs, which are from the first 2 shutdown servers. > >> > >> > > I'm looking at these logs now. > > > > > > > >> > > >> >Are the crawlers running on same machines as hbase? > >> Yes, they are. And the cluster we are using is like that: > >> 1 master/namenode/jobtracker/crawler server. > >> 6 rs/zk/datanode/tasktracker/crawler client. > >> > >> > > You don't need 6 zk nodes. Make it 3 (In an earlier mail you said you > had > > 3 only but above would seem to say you have 6 zk nodes). You could up > the > > ticktime from 2 to 3 seconds (We learned that maximum session time is > > 20*ticktime no matter what session timeout is set to). > > > > For 8Gs of RAM and a single disk might not be sufficient to carry all of > > the above daemons on a single node. You have cluster-wide monitoring > > setup? Are these nodes swapping? From the logs I've seen, HDFS is for > sure > > struggling. The crawler writes straight into hbase or does it write the > > local disk? What is the tasktracker doing? Are mapreduce jobs running > > concurrently on these nodes? > > > > > >> >Can you update to hbase 0.20.2? It has a bunch of fixes that could be > >> >related to the above. > >> We have updated to hbase 0.20.2 this morning, and we will re-run the > >> crawler > >> in a minute, and we will share the result with you. > >> > >> Good stuff. > > > > I'll be back after some study of your logs. > > St.Ack > > > > > > > >> Regards, > >> LvZheng. > >> > >> > >> 2009/11/21 stack > >> > >> > On Fri, Nov 20, 2009 at 12:28 AM, Zheng Lv >> > >wrote: > >> > > >> > > Hello Stack, > >> > > Remember the "no route to host" exceptions last time? Now there > isn't > >> any > >> > > more, and the test program can be running for several days. > >> > > >> > > >> > How did you fix it? > >> > > >> > > >> > > >> > > Thank you. > >> > > Recently we started running our crawling program, which crawls > >> webpages > >> > and > >> > > then insert them to hbase. > >> > > But we got so many > "org.apache.hadoop.hbase.NotServingRegionException" > >> > like > >> > > that: > >> > > > >> > > 2009-11-20 12:36:41,898 ERROR > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > > org.apache.hadoop.hbase.NotServingRegionException: webpage, > >> > > > >> > > > >> > > >> > http://bbs.city.tianya.cn/tianyacity/Content/178/1/536629.shtml,1258691377544 > >> > > > >> > > >> > So figure out whats happening to that region by grepping its name in > the > >> > master log. Why is it offline so long? Are machines loaded? > Swapping? > >> > > >> > Are the crawlers running on same machines as hbase? > >> > > >> > What crawler are you using? > >> > > >> > Andrew Purtell has written up some notes on getting a nice balance > >> between > >> > crawl process and hbase such that all runs smoothly in private > >> > correspondence. Let me ask him if its ok to forward the list. > >> > > >> > > >> > .... > >> > > >> > 2009-11-20 12:36:25,259 INFO > >> org.apache.hadoop.hbase.master.ServerManager: > >> > > Processing MSG_REPORT_SPLIT: > >> > > webpage,http:\x2F\x2Fbbs.city.tianya.cn< > >> http://x2fbbs.city.tianya.cn/> > >> > > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691377544: > >> > > Daughters; webpage,http:\x2F\x2Fbbs.city.tianya.cn > < > >> http://x2fbbs.city.tianya.cn/> > >> > > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691779496, > >> > > webpage,http:\x2F\x2Fbbs.city.tianya.cn< > >> http://x2fbbs.city.tianya.cn/> > >> > > \x2Ftianyacity\x2FContent\x2F329\x2F1\x2F164370.shtml,1258691779496 > >> > > from ubuntu12,60020,1258687326554; > >> > > > >> > > Yeah, its split. Thats normal. Whats not normal is the client not > >> > finding > >> > the daughter split in its new location. Did the daughters get > deployed > >> > promptly? > >> > > >> > > >> > > >> > > And a few hours later, some rs shutdown. > >> > > > >> > > I read the mail > >> > > > >> > > > >> > > >> > http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200907.mbox/%3C9b27a8a60907272122y1bfa6254n95948942d5ca7f88@mail.gmail.com%3E > >> > > , > >> > > which was sent by my partner Angus. In the mail you told us it was a > >> case > >> > > of > >> > > "HBASE-1671", Fix Version of which is 0.20.0, but the hbase version > we > >> > are > >> > > using is just 0.20.0. > >> > > > >> > > >> > Can you update to hbase 0.20.2? It has a bunch of fixes that could be > >> > related to the above. > >> > Yours, > >> > St.Ack > >> > > >> > > >> > > >> > > Any idea? > >> > > Best Regards, > >> > > LvZheng > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > 2009/10/13 stack > >> > > > >> > > > Thanks for posting. Its much easier reading the logs from there. > >> > > > > >> > > > Looking in nohup.out I see it can't find region > 'webpage,http:\x2F\ > >> > > > x2Fnews.163.com < > http://x2fnews.163.com/> > >> > > > > >> > > > >> > > >> > \x2F09\x2F080\x2F0\x2F5FOO155J0001124J.html1255072992000_751685,1255316061169'. > >> > > > It never finds it. It looks like it was assigned successfully to > >> > > > 192.168.33.5 going by the master log. Once you've figured out the > >> > > > hardware/networking issues, lets work at getting that region back > on > >> > > line. > >> > > > > >> > > > The master timed out its session against zk because of 'no route > to > >> > > host'. > >> > > > > >> > > > St.Ack > >> > > > > >> > > > On Mon, Oct 12, 2009 at 12:23 AM, Zheng Lv < > >> lvzheng19800619@gmail.com > >> > > > >wrote: > >> > > > > >> > > > > Hello Stack, > >> > > > > I have enabled DEBUG and restarted the test program. This > time > >> the > >> > > > > master shut down, and I have put the logs on skydrive. > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public?uc=2&isFromRichUpload=1 > >> > > > > . > >> > > > > "nohup.out" is our test program log, > >> > "hbase-cyd-master-ubuntu6.log" > >> > > is > >> > > > > master log. > >> > > > > > >> > > > > On the other hand, today we found that when we run "dmesg", > >> there > >> > > were > >> > > > > many logs like "[3641697.122769] r8169: eth0: link down". And I > >> think > >> > > > this > >> > > > > might be the reason of so many "no route to host" and "Time > Out". > >> Now > >> > > our > >> > > > > system manager is checking, if we have a result we will let you > >> > know.:) > >> > > > > Thanks, > >> > > > > LvZheng. > >> > > > > > >> > > > > 2009/10/11 stack > >> > > > > > >> > > > > > On Fri, Oct 9, 2009 at 3:18 AM, Zheng Lv < > >> > lvzheng19800619@gmail.com> > >> > > > > > wrote: > >> > > > > > > >> > > > > > > ... > >> > > > > > > so, > >> > > > > > > > please remove the delay so hbase fails faster so it > >> doesn't > >> > > take > >> > > > > so > >> > > > > > > long to > >> > > > > > > > figure the issue. > >> > > > > > > > Are you inserting every 10ms because hbase is falling > >> over > >> > on > >> > > > you? > >> > > > > > If > >> > > > > > > Yes I inserted every 10ms because I'm afraid hbase would > >> fall > >> > > > over. > >> > > > > > Now > >> > > > > > > I have removed the delay. > >> > > > > > > > >> > > > > > > After doing these, We have run the test program again, > and > >> one > >> > > > > region > >> > > > > > > server shut down after about 2 hours, another one 3. > >> > > > > > > I will post the logs on these two servers in following > >> reply > >> > > > mails. > >> > > > > > > > >> > > > > > > > >> > > > > > Thanks for doing the above. > >> > > > > > > >> > > > > > For the future, debugging, please enable DEBUG and put your > logs > >> > > > > somewhere > >> > > > > > where I can pull them or put them up in pastebin. Logs in > email > >> > > > messages > >> > > > > > are hard to follow. Thanks. > >> > > > > > > >> > > > > > > >> > > > > > > > Ok. So this is hbase 0.20.0? Tell us about your > >> hardware. > >> > > > What > >> > > > > > kind > >> > > > > > > is > >> > > > > > > > it? CPU/RAM/Disks. > >> > > > > > > Yes we are using hbase 0.20.0. And the following is our > >> > > > hardware: > >> > > > > > > > >> > > > > > > CPU:amd x3 710 > >> > > > > > > RAM:8g ddr2 800 > >> > > > > > > Disk:270g(raid0) > >> > > > > > > > >> > > > > > > > >> > > > > > Thats an interesting chip -- 3 cores! The above should be > fine > >> as > >> > > long > >> > > > > as > >> > > > > > you coral your mapreduce jobs running on same cluster. > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > We have 7 servers with above hardware, one for master, > >> three > >> > for > >> > > > > > > namenodes / regionservers, and the other 3 for zks. > >> > > > > > > By the way, what kind of hardware and environment do you > >> > suggest > >> > > > we > >> > > > > > > have? > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > This configuration seems fine to start with. Later we might > >> > > experiment > >> > > > > > running zk on same machines as regionservers and then up > number > >> of > >> > > > > > regionservers to 6 and up the quorum members to 5. > >> > > > > > > >> > > > > > St.Ack > >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > Thank you, very much. > >> > > > > > > LvZheng. > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > > --005045017683ac32ef04792a66c1--