hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: managing 5-10 servers
Date Wed, 24 Nov 2010 18:07:25 GMT
Not just su.pr, but also stumbleupon.com which has the "social" layer.
We do have memcached in front of HBase. Regarding blog posts about our
setup, just search for "stumbleupon hbase" and you'll find tons. The
most recent presentation that's available online is my talk at Hadoop
World.

Vid: http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop
Slides: http://www.cloudera.com/resource/hw10_stumbleupon_advertising_platform_using_hbase

J-D

On Wed, Nov 24, 2010 at 6:22 AM, S Ahmed <sahmed1020@gmail.com> wrote:
> So you have 20 nodes for the stumbled upon link redirection service?
>
> Are there any blog posts that go over the setup and what sort of read/write
> traffic it gets?  Is there a memcached layer that sites in front?
>
> On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> I wish I could do a dump of my memory into an ops guide to HBase, but
>> currently I don't think there's such a writeup.
>>
>> What can go wrong... again it depends on your type of usage. With a
>> MR-heavy cluster, it's usually very easy to drive the IO wait through
>> the roof and then you'll end up with GC pauses >60 secs caused by CPU
>> starvation. Here's a recent example we got when a big Mahout job was
>> running:
>>
>> 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K),
>> 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs]
>> [Times: user=4.45 sys=2.02, real=104.72 secs]
>>
>> The trained eye will quickly see that something very bad happened on
>> that cluster. Indeed, during post-mortem we saw that somehow that
>> machine started swapping which is the Worst Thing Ever (tm) that can
>> happen to a machine that runs java processes. Make sure that your
>> memory usage always stay under your total memory, even when all the
>> mappers and reducers are using their heap at the fullest. And then
>> double check that (which it seems we didn't do).
>>
>> On a cluster that serves web traffic, and thus must not be MRed
>> against, you get the "usual" stuff like bad disks and operator errors.
>>
>> J-D
>>
>> On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <sahmed1020@gmail.com> wrote:
>> > Are there any writeups on what things to look for?
>> >
>> > What are some of the things that usually go wrong? Or is that an unfair
>> > question :)
>> >
>> > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> Constant hand holding no, constant monitoring yes. Do setup Ganglia
>> >> and preferably Nagios. Then it depends what you're planning to do with
>> >> your cluster... here we have 2x 20 machines in production, the one
>> >> that serves live traffic is pretty much doing it's own thing by itself
>> >> (although I keep a ganglia tab opened on a second monitor) and the
>> >> other one is used strictly for MapReduce for which our internal users
>> >> have developed a habit of running very destructive jobs on. But to be
>> >> fair, it's probably the users that need support the most ;)
>> >>
>> >> J-D
>> >>
>> >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sahmed1020@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > How much of a guru do you have to be to keep say 5-10 servers humming?
>> >> >
>> >> > I'm a 1-man shop, and I dream of developing a web application, and
>> >> scaling
>> >> > will be a core part of the application.
>> >> >
>> >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase
>> >> cluster?
>> >> > Is it something that requires hand holding and constant monitoring
or
>> it
>> >> > tends to be hands off?
>> >> >
>> >>
>> >
>>
>

Mime
View raw message