hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: is there any problem with our environment?
Date Wed, 25 Nov 2009 17:48:09 GMT
Try this: Do not cache on the crawlers, just write through.

Run each region server with plenty of heap (4 GB to start). So it seems you need more RAM
on your systems, or you should move your crawlers off to separate servers to free up RAM and
CPU. 

Adjust your HBase site config as follows:

  <property>
    <name>hbase.hregion.memstore.block.multiplier</name>
    <value>4</value>
  </property>

  <property>
    <name>hbase.hstore.blockingStoreFiles</name>
    <value>20</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>

  <property>
    <name>hfile.block.cache.size</name>
    <!-- disable -->
    <value>0</value>
  </property>

Be advised that the probability of OOM under high write load is high if you do not provide
adequate heap. This may manifest either as OOM exceptions or as GC taking so much time as
to drop the ZK session. Concurrent garbage collector will keep GC latency low but it needs
a larger available heap space in return. You should also tune your HBASE_OPTS in conf/hbase-env.sh
according to this advice: http://osdir.com/ml/hbase-user-hadoop-apache/2009-10/msg00276.html

With your monitoring tools watch the amount of disk wait time on the OS level. If disk wait
times exceed 40%, you need more spindles (disk) for your datanodes or more datanodes (servers)
to spread the I/O load. 

   - Andy




________________________________
From: Zheng Lv <lvzheng19800619@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Tue, November 24, 2009 8:22:04 PM
Subject: Re: is there any problem with our environment?

Hello Stack,
>Sorry for taking time getting back to you Lv.
Never mind:).

>I posted Andrew's crawling notes yesterday. Did you see them?  I thought
>they might be of use to you.
You mean you'v sent mail to me? I didn't get it.

>This is interesting.  You were just looking at message logs and saw the
>above?
We checked /var/log/messages on some servers, and we got many "link down"
and "link up".

>You don't need 6 zk nodes.  Make it 3 (In an earlier mail you said you had
3
>only but above would seem to say you have 6 zk nodes).  You could up the
>ticktime from 2 to 3 seconds (We learned that maximum session time is
>20*ticktime no matter what session timeout is set to).
>
We will change our cluster as you told me.
>(In an earlier mail you said you had 3 only but above would seem to say you
have 6 zk nodes)
We added some, we thought 3 is insufficient.

>For 8Gs of RAM and a single disk might not be sufficient to carry all of
the
>above daemons on a single node.  You have cluster-wide monitoring setup?
>Are these nodes swapping?  From the logs I've seen, HDFS is for sure
>struggling.
Our system manager has just deployed mrtg+snmp on those servers, and we will
share the data with you.

>The crawler writes straight into hbase or does it write the
>local disk?
Each crawler has a cache in memory, when it is full, the crawler will flush
webpages in cache to hbase.
>What is the tasktracker doing?  Are mapreduce jobs running
>concurrently on these nodes?
Now the tasktracker process is running, but there is not any job on it. But,
when updater, which is another part of our product, is on there will be.
We will restart hbase0.20.2 with DEBUG enabled and monitor it, I'll be back
when we get useful data.
Regards,
Lv Zheng.







2009/11/25 stack <stack@duboce.net>

> I took a look at your logs.  Please run with DEBUG enabled going forward
> (if
> you upgraded to 0.20.2, then it'll be on by default).
>
> HDFS is struggling.
>
> Looking at this region:
>
>
> http://blog.blog.tianya.cn/blogger/post_Date.asp?blogID=1818563&CategoryID=1256944&idWriter=0&Key=0&NextPostID=888888888&PageNo=2
>
> ... there seems to be an issue w/ accounting.  It looks like it got
> assigned
> to 13 before the logs you gave me but then the regionserver its on, while
> it
> thinks its carrying it in one regard -- its trying to flush it so it can
> let
> go of old WAL files -- it then is telling clients that ask for it that its
> not serving it.  Seeing an earlier master and regionserver 13 log would
> help
> figure what happened.  Try restart on 0.20.2.  Tell me more about the
> loading on these machines, how its done.  Do you have monitoring software
> running?
>
> Thanks,
> St.Ack
>
>
>
>
>
> On Tue, Nov 24, 2009 at 10:30 AM, stack <stack@duboce.net> wrote:
>
> > Sorry for taking time getting back to you Lv.
> >
> > I posted Andrew's crawling notes yesterday. Did you see them?  I thought
> > they might be of use to you.
> >
> >
> > On Sun, Nov 22, 2009 at 10:30 PM, Zheng Lv <lvzheng19800619@gmail.com
> >wrote:
> >
> >> >How did you fix it?
> >> We found some "link down" and "link up" in some server's syslog, our
> >> system
> >> manager suggested to change the network cards on those servers, so we
> >> changed them, and the errors disappeared.
> >>
> >
> >
> > This is interesting.  You were just looking at message logs and saw the
> > above?
> >
> >
> >
> >>
> >>
> >> >So figure out whats happening to that region by grepping its name in
> the
> >> >master log.  Why is it offline so long?  Are machines loaded?
>  Swapping?
> >>
> >> >Yeah, its split.  Thats normal.  Whats not normal is the client not
> >> finding
> >> >the daughter split in its new location.  Did the daughters get deployed
> >> >promptly?
> >> I'm sorry, I don't know exactly how I can get these infomation from the
> >> log.
> >> Can you tell me? And I have upload the logs to sky drives
> >>
> >>
> http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public/1123?uc=1&isFromRichUpload=1&lc=2052
> >> .
> >> There are two master logs, which are from different days, and two
> >> regionserver logs, which are from the first 2 shutdown servers.
> >>
> >>
> >  I'm looking at these logs now.
> >
> >
> >
> >>  >
> >> >Are the crawlers running on same machines as hbase?
> >> Yes, they are. And the cluster we are using is like that:
> >> 1 master/namenode/jobtracker/crawler server.
> >> 6 rs/zk/datanode/tasktracker/crawler client.
> >>
> >>
> > You don't need 6 zk nodes.  Make it 3 (In an earlier mail you said you
> had
> > 3 only but above would seem to say you have 6 zk nodes).  You could up
> the
> > ticktime from 2 to 3 seconds (We learned that maximum session time is
> > 20*ticktime no matter what session timeout is set to).
> >
> > For 8Gs of RAM and a single disk might not be sufficient to carry all of
> > the above daemons on a single node.  You have cluster-wide monitoring
> > setup?  Are these nodes swapping?  From the logs I've seen, HDFS is for
> sure
> > struggling.   The crawler writes straight into hbase or does it write the
> > local disk?  What is the tasktracker doing?  Are mapreduce jobs running
> > concurrently on these nodes?
> >
> >
> >> >Can you update to hbase 0.20.2?  It has a bunch of fixes that could be
> >> >related to the above.
> >> We have updated to hbase 0.20.2 this morning, and we will re-run the
> >> crawler
> >> in a minute, and we will share the result with you.
> >>
> >> Good stuff.
> >
> > I'll be back after some study of your logs.
> > St.Ack
> >
> >
> >
> >> Regards,
> >> LvZheng.
> >>
> >>
> >> 2009/11/21 stack <stack@duboce.net>
> >>
> >> > On Fri, Nov 20, 2009 at 12:28 AM, Zheng Lv <lvzheng19800619@gmail.com
> >> > >wrote:
> >> >
> >> > > Hello Stack,
> >> > > Remember the "no route to host" exceptions last time? Now there
> isn't
> >> any
> >> > > more, and the test program can be running for several days.
> >> >
> >> >
> >> > How did you fix it?
> >> >
> >> >
> >> >
> >> > > Thank you.
> >> > > Recently we started running our crawling program, which crawls
> >> webpages
> >> > and
> >> > > then insert them to hbase.
> >> > > But we got so many
> "org.apache.hadoop.hbase.NotServingRegionException"
> >> > like
> >> > > that:
> >> > >
> >> > > 2009-11-20 12:36:41,898 ERROR
> >> > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > org.apache.hadoop.hbase.NotServingRegionException: webpage,
> >> > >
> >> > >
> >> >
> >>
> http://bbs.city.tianya.cn/tianyacity/Content/178/1/536629.shtml,1258691377544
> >> > >
> >> >
> >> > So figure out whats happening to that region by grepping its name in
> the
> >> > master log.  Why is it offline so long?  Are machines loaded?
>  Swapping?
> >> >
> >> > Are the crawlers running on same machines as hbase?
> >> >
> >> > What crawler are you using?
> >> >
> >> > Andrew Purtell has written up some notes on getting a nice balance
> >> between
> >> > crawl process and hbase such that all runs smoothly in private
> >> > correspondence.  Let me ask him if its ok to forward the list.
> >> >
> >> >
> >> > ....
> >> >
> >> > 2009-11-20 12:36:25,259 INFO
> >> org.apache.hadoop.hbase.master.ServerManager:
> >> > > Processing MSG_REPORT_SPLIT:
> >> > > webpage,http:\x2F\x2Fbbs.city.tianya.cn<http://x2fbbs.city.tianya.cn/><
> >> http://x2fbbs.city.tianya.cn/>
> >> > > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691377544:
> >> > > Daughters; webpage,http:\x2F\x2Fbbs.city.tianya.cn<http://x2fbbs.city.tianya.cn/>
> <
> >> http://x2fbbs.city.tianya.cn/>
> >> > > \x2Ftianyacity\x2FContent\x2F178\x2F1\x2F536629.shtml,1258691779496,
> >> > > webpage,http:\x2F\x2Fbbs.city.tianya.cn<http://x2fbbs.city.tianya.cn/><
> >> http://x2fbbs.city.tianya.cn/>
> >> > > \x2Ftianyacity\x2FContent\x2F329\x2F1\x2F164370.shtml,1258691779496
> >> > > from ubuntu12,60020,1258687326554;
> >> > >
> >> > > Yeah, its split.  Thats normal.  Whats not normal is the client not
> >> > finding
> >> > the daughter split in its new location.  Did the daughters get
> deployed
> >> > promptly?
> >> >
> >> >
> >> >
> >> > > And a few hours later, some rs shutdown.
> >> > >
> >> > > I read the mail
> >> > >
> >> > >
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200907.mbox/%3C9b27a8a60907272122y1bfa6254n95948942d5ca7f88@mail.gmail.com%3E
> >> > > ,
> >> > > which was sent by my partner Angus. In the mail you told us it was
a
> >> case
> >> > > of
> >> > > "HBASE-1671", Fix Version of which is 0.20.0, but the hbase version
> we
> >> > are
> >> > > using is just 0.20.0.
> >> > >
> >> >
> >> > Can you update to hbase 0.20.2?  It has a bunch of fixes that could be
> >> > related to the above.
> >> > Yours,
> >> > St.Ack
> >> >
> >> >
> >> >
> >> > > Any idea?
> >> > > Best Regards,
> >> > > LvZheng
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > 2009/10/13 stack <stack@duboce.net>
> >> > >
> >> > > > Thanks for posting.  Its much easier reading the logs from there.
> >> > > >
> >> > > > Looking in nohup.out I see it can't find region
> 'webpage,http:\x2F\
> >> > > > x2Fnews.163.com <http://x2fnews.163.com/> <
> http://x2fnews.163.com/> <http://x2fnews.163.com/>
> >> >  > >
> >> > >
> >> >
> >>
> \x2F09\x2F080\x2F0\x2F5FOO155J0001124J.html1255072992000_751685,1255316061169'.
> >> > > > It never finds it.   It looks like it was assigned successfully
to
> >> > > > 192.168.33.5 going by the master log.  Once you've figured out
the
> >> > > > hardware/networking issues, lets work at getting that region
back
> on
> >> > > line.
> >> > > >
> >> > > > The master timed out its session against zk because of 'no route
> to
> >> > > host'.
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > > On Mon, Oct 12, 2009 at 12:23 AM, Zheng Lv <
> >> lvzheng19800619@gmail.com
> >> > > > >wrote:
> >> > > >
> >> > > > > Hello Stack,
> >> > > > >    I have enabled DEBUG and restarted the test program.
This
> time
> >> the
> >> > > > > master shut down, and I have put the logs on skydrive.
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://cid-a331bb289a14fbef.skydrive.live.com/browse.aspx/.Public?uc=2&isFromRichUpload=1
> >> > > > > .
> >> > > > >    "nohup.out" is our test program log,
> >> > "hbase-cyd-master-ubuntu6.log"
> >> > > is
> >> > > > > master log.
> >> > > > >
> >> > > > >    On the other hand, today we found that when we run "dmesg",
> >> there
> >> > > were
> >> > > > > many logs like "[3641697.122769] r8169: eth0: link down".
And I
> >> think
> >> > > > this
> >> > > > > might be the reason of so many "no route to host" and "Time
> Out".
> >> Now
> >> > > our
> >> > > > > system manager is checking, if we have a result we will
let you
> >> > know.:)
> >> > > > >    Thanks,
> >> > > > >    LvZheng.
> >> > > > >
> >> > > > > 2009/10/11 stack <stack@duboce.net>
> >> > > > >
> >> > > > > > On Fri, Oct 9, 2009 at 3:18 AM, Zheng Lv <
> >> > lvzheng19800619@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > ...
> >> > > > > > > so,
> >> > > > > > >    > please remove the delay so hbase fails
faster so it
> >> doesn't
> >> > > take
> >> > > > > so
> >> > > > > > > long to
> >> > > > > > >    > figure the issue.
> >> > > > > > >    > Are you inserting every 10ms because hbase
is falling
> >> over
> >> > on
> >> > > > you?
> >> > > > > >  If
> >> > > > > > >    Yes I inserted every 10ms because I'm afraid
hbase would
> >> fall
> >> > > > over.
> >> > > > > > Now
> >> > > > > > > I have removed the delay.
> >> > > > > > >
> >> > > > > > >    After doing these, We have run the test program
again,
> and
> >> one
> >> > > > > region
> >> > > > > > > server shut down after about 2 hours, another
one 3.
> >> > > > > > >    I will post the logs on these two servers in
following
> >> reply
> >> > > > mails.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > Thanks for doing the above.
> >> > > > > >
> >> > > > > > For the future, debugging, please enable DEBUG and
put your
> logs
> >> > > > > somewhere
> >> > > > > > where I can pull them or put them up in pastebin. 
Logs in
> email
> >> > > > messages
> >> > > > > > are hard to follow.  Thanks.
> >> > > > > >
> >> > > > > >
> >> > > > > > >    > Ok.  So this is hbase 0.20.0?  Tell us
about your
> >> hardware.
> >> > > >  What
> >> > > > > > kind
> >> > > > > > > is
> >> > > > > > >    > it?  CPU/RAM/Disks.
> >> > > > > > >     Yes we are using  hbase 0.20.0. And the following
is our
> >> > > > hardware:
> >> > > > > > >
> >> > > > > > >    CPU:amd x3 710
> >> > > > > > >    RAM:8g ddr2 800
> >> > > > > > >    Disk:270g(raid0)
> >> > > > > > >
> >> > > > > > >
> >> > > > > > Thats an interesting chip -- 3 cores!  The above should
be
> fine
> >> as
> >> > > long
> >> > > > > as
> >> > > > > > you coral your mapreduce jobs running on same cluster.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > >    We have 7 servers with above hardware, one
for master,
> >> three
> >> > for
> >> > > > > > > namenodes / regionservers, and the other 3 for
zks.
> >> > > > > > >    By the way, what kind of hardware and environment
do you
> >> > suggest
> >> > > > we
> >> > > > > > > have?
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > This configuration seems fine to start with.  Later
we might
> >> > > experiment
> >> > > > > > running zk on same machines as regionservers and then
up
> number
> >> of
> >> > > > > > regionservers to 6 and up the quorum members to 5.
> >> > > > > >
> >> > > > > > St.Ack
> >> > > > > >
> >> > > > > >
> >> > > > > > >
> >> > > > > > >    Thank you, very much.
> >> > > > > > >    LvZheng.
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message