hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lv <lvzheng19800...@gmail.com>
Subject Re: is there any problem with our environment?
Date Mon, 23 Nov 2009 06:30:52 GMT
Hello Stack,

>How did you fix it?
We found some "link down" and "link up" in some server's syslog, our system
manager suggested to change the network cards on those servers, so we
changed them, and the errors disappeared.


>So figure out whats happening to that region by grepping its name in the
>master log.  Why is it offline so long?  Are machines loaded?  Swapping?

>Yeah, its split.  Thats normal.  Whats not normal is the client not finding
>the daughter split in its new location.  Did the daughters get deployed
>promptly?
I'm sorry, I don't know exactly how I can get these infomation from the log.
Can you tell me? And I have upload the logs to sky drives
http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public/1123?uc=1&isFromRichUpload=1&lc=2052.
There are two master logs, which are from different days, and two
regionserver logs, which are from the first 2 shutdown servers.


>
>Are the crawlers running on same machines as hbase?
Yes, they are. And the cluster we are using is like that:
1 master/namenode/jobtracker/crawler server.
6 rs/zk/datanode/tasktracker/crawler client.

>
>What crawler are you using?
The crawler we are using is developed by ourselves, which are composed of a
server and several client, communicating with RMI.

>
>Andrew Purtell has written up some notes on getting a nice balance between
>crawl process and hbase such that all runs smoothly in private
>correspondence.  Let me ask him if its ok to forward the list.
 If so, we will be very happy and thankful.

>Can you update to hbase 0.20.2?  It has a bunch of fixes that could be
>related to the above.
We have updated to hbase 0.20.2 this morning, and we will re-run the crawler
in a minute, and we will share the result with you.

Regards,
LvZheng.


2009/11/21 stack <stack@duboce.net>

> On Fri, Nov 20, 2009 at 12:28 AM, Zheng Lv <lvzheng19800619@gmail.com
> >wrote:
>
> > Hello Stack,
> > Remember the "no route to host" exceptions last time? Now there isn't any
> > more, and the test program can be running for several days.
>
>
> How did you fix it?
>
>
>
> > Thank you.
> > Recently we started running our crawling program, which crawls webpages
> and
> > then insert them to hbase.
> > But we got so many "org.apache.hadoop.hbase.NotServingRegionException"
> like
> > that:
> >
> > 2009-11-20 12:36:41,898 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > org.apache.hadoop.hbase.NotServingRegionException: webpage,
> >
> >
> http://bbs.city.tianya.cn/tianyacity/Content/178/1/536629.shtml,1258691377544
> >
>
> So figure out whats happening to that region by grepping its name in the
> master log.  Why is it offline so long?  Are machines loaded?  Swapping?
>
> Are the crawlers running on same machines as hbase?
>
> What crawler are you using?
>
> Andrew Purtell has written up some notes on getting a nice balance between
> crawl process and hbase such that all runs smoothly in private
> correspondence.  Let me ask him if its ok to forward the list.
>
>
> ....
>
> 2009-11-20 12:36:25,259 INFO org.apache.hadoop.hbase.master.ServerManager:
> > Processing MSG_REPORT_SPLIT:
> > webpage,http:\x2F\x2Fbbs.city.tianya.cn <http://x2fbbs.city.tianya.cn/>
> > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691377544:
> > Daughters; webpage,http:\x2F\x2Fbbs.city.tianya.cn<http://x2fbbs.city.tianya.cn/>
> > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691779496,
> > webpage,http:\x2F\x2Fbbs.city.tianya.cn <http://x2fbbs.city.tianya.cn/>
> > \x2Ftianyacity\x2FContent\x2F329\x2F1\x2F164370.shtml,1258691779496
> > from ubuntu12,60020,1258687326554;
> >
> > Yeah, its split.  Thats normal.  Whats not normal is the client not
> finding
> the daughter split in its new location.  Did the daughters get deployed
> promptly?
>
>
>
> > And a few hours later, some rs shutdown.
> >
> > I read the mail
> >
> >
> http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200907.mbox/%3C9b27a8a60907272122y1bfa6254n95948942d5ca7f88@mail.gmail.com%3E
> > ,
> > which was sent by my partner Angus. In the mail you told us it was a case
> > of
> > "HBASE-1671", Fix Version of which is 0.20.0, but the hbase version we
> are
> > using is just 0.20.0.
> >
>
> Can you update to hbase 0.20.2?  It has a bunch of fixes that could be
> related to the above.
> Yours,
> St.Ack
>
>
>
> > Any idea?
> > Best Regards,
> > LvZheng
> >
> >
> >
> >
> >
> > 2009/10/13 stack <stack@duboce.net>
> >
> > > Thanks for posting.  Its much easier reading the logs from there.
> > >
> > > Looking in nohup.out I see it can't find region 'webpage,http:\x2F\
> > > x2Fnews.163.com <http://x2fnews.163.com/> <http://x2fnews.163.com/>
>  > >
> >
> \x2F09\x2F080\x2F0\x2F5FOO155J0001124J.html1255072992000_751685,1255316061169'.
> > > It never finds it.   It looks like it was assigned successfully to
> > > 192.168.33.5 going by the master log.  Once you've figured out the
> > > hardware/networking issues, lets work at getting that region back on
> > line.
> > >
> > > The master timed out its session against zk because of 'no route to
> > host'.
> > >
> > > St.Ack
> > >
> > > On Mon, Oct 12, 2009 at 12:23 AM, Zheng Lv <lvzheng19800619@gmail.com
> > > >wrote:
> > >
> > > > Hello Stack,
> > > >    I have enabled DEBUG and restarted the test program. This time the
> > > > master shut down, and I have put the logs on skydrive.
> > > >
> > > >
> > >
> >
> http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public?uc=2&isFromRichUpload=1
> > > > .
> > > >    "nohup.out" is our test program log,
> "hbase-cyd-master-ubuntu6.log"
> > is
> > > > master log.
> > > >
> > > >    On the other hand, today we found that when we run "dmesg", there
> > were
> > > > many logs like "[3641697.122769] r8169: eth0: link down". And I think
> > > this
> > > > might be the reason of so many "no route to host" and "Time Out". Now
> > our
> > > > system manager is checking, if we have a result we will let you
> know.:)
> > > >    Thanks,
> > > >    LvZheng.
> > > >
> > > > 2009/10/11 stack <stack@duboce.net>
> > > >
> > > > > On Fri, Oct 9, 2009 at 3:18 AM, Zheng Lv <
> lvzheng19800619@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > ...
> > > > > > so,
> > > > > >    > please remove the delay so hbase fails faster so it
doesn't
> > take
> > > > so
> > > > > > long to
> > > > > >    > figure the issue.
> > > > > >    > Are you inserting every 10ms because hbase is falling
over
> on
> > > you?
> > > > >  If
> > > > > >    Yes I inserted every 10ms because I'm afraid hbase would
fall
> > > over.
> > > > > Now
> > > > > > I have removed the delay.
> > > > > >
> > > > > >    After doing these, We have run the test program again, and
one
> > > > region
> > > > > > server shut down after about 2 hours, another one 3.
> > > > > >    I will post the logs on these two servers in following reply
> > > mails.
> > > > > >
> > > > > >
> > > > > Thanks for doing the above.
> > > > >
> > > > > For the future, debugging, please enable DEBUG and put your logs
> > > > somewhere
> > > > > where I can pull them or put them up in pastebin.  Logs in email
> > > messages
> > > > > are hard to follow.  Thanks.
> > > > >
> > > > >
> > > > > >    > Ok.  So this is hbase 0.20.0?  Tell us about your hardware.
> > >  What
> > > > > kind
> > > > > > is
> > > > > >    > it?  CPU/RAM/Disks.
> > > > > >     Yes we are using  hbase 0.20.0. And the following is our
> > > hardware:
> > > > > >
> > > > > >    CPU:amd x3 710
> > > > > >    RAM:8g ddr2 800
> > > > > >    Disk:270g(raid0)
> > > > > >
> > > > > >
> > > > > Thats an interesting chip -- 3 cores!  The above should be fine as
> > long
> > > > as
> > > > > you coral your mapreduce jobs running on same cluster.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >    We have 7 servers with above hardware, one for master, three
> for
> > > > > > namenodes / regionservers, and the other 3 for zks.
> > > > > >    By the way, what kind of hardware and environment do you
> suggest
> > > we
> > > > > > have?
> > > > > >
> > > > >
> > > > >
> > > > > This configuration seems fine to start with.  Later we might
> > experiment
> > > > > running zk on same machines as regionservers and then up number of
> > > > > regionservers to 6 and up the quorum members to 5.
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > > >
> > > > > >    Thank you, very much.
> > > > > >    LvZheng.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message