hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Read/Write Performance
Date Wed, 29 Dec 2010 05:55:21 GMT
On Mon, Dec 27, 2010 at 11:47 AM, Wayne <wav100@gmail.com> wrote:
> All data is written to 3 CFs. Basically 2 of the CFs are secondary indexes
> (manually managed as normal CFs). It sounds like we should try hard to get
> as much out of thrift as we can before going to a lower level.


Writes need
> to be "fast enough", but reads are more important in the end (and are the
> reason we are switching from a different solution). The numbers you quoted
> below sound like they are in the ballpark of what we are looking to do.

Even the tens per second that I threw in there to CMA?

> Much of our data is cold, and we expect reads to be disk i/o based.

OK.  FYI, we're not the best at this -- cache-miss cold reads -- what
w/ a network hop in the way and currently we'll open a socket per

> Given
> this is 8GB heap a good place to start on the data nodes (24GB ram)? Is the
> block cache managed on its own (being it won't blow up causing OOM),

It won't.  Its constrained.  Does our home-brewed sizeof.  Default,
its 0.2 of total heap.  If you think cache will help, you could go up
from there.  0.4 or 0.5 of heap.

> and if
> we do not use it (block cache) should we go even lower for the heap (we want
> to avoid CMF and long GC pauses)?

If you are going to be doing cache-miss most of the time and cold
reads, then yes, you can do away with cache.

In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions
but this is my trying to break stuff.

> Are there any timeouts we need to tweak to
> make the cluster more "accepting" of long GC pauses while under sustained
> load (7+ days of 10k/inserts/sec/node)?

If zookeeper client timesout, the regionserver will shut itself down.
In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If you
timeout that, then thats pretty extreme... something badly wrong I'd
say.  Heres' a few notes on the config and others that you might want
to twiddle (see previous section on required configs... make sure
you've got those too):

> Does LZO compression speed up reads/writes where there is excess CPU to do
> the compression? I assume it would lower disk i/o but increase CPU a lot. Is
> data compressed on the initial write or only after compaction?

LZO is pretty frictionless -- i.e. little CPU cost -- and yes, usually
helps speed things up (grab more in the one go).  What size are your
records?  You might want to mess w/ hfile block sizes though the 64k
default is usually good enough for all but very small cell sizes.

> With the replication in the HDFS layer how are reads managed in terms of
> load balancing across region servers? Does HDFS know to spread multiple
> requests across the 3 region servers that contain the same data?

You only read from one of the replicas, always the 'closest'.  If the
DFSClient has trouble getting the first of the replicas, it moves on
to the second, etc.

> For example
> with 10 data nodes if we have 50 concurrent readers with very "random" key
> requests we would expect to have 5 reads occurring on each data node at the
> same time. We plan to have a thrift server on each data node, so 5
> concurrent readers will be connected to each thrift server at any given time
> (50 in aggregate across 10 nodes). We want to be sure everything is designed
> to evenly spread this load to avoid any possible hot-spots.

This is different.  This is key design.  A thrift server will be doing
some subset of the key space.  If the requests are evenly distributed
over all of the key space, then you should be fine; all thrift servers
will be evenly loaded.  If not, then there could be hot spots.

We have a balancer that currently only counts regions per server, not
regions per server plus hits per region so it could be the case that a
server by chance ends up carrying all of the hot regions.  HBase
itself is not too smart dealing with this.  In 0.90.0, there is
facility for manually moving regions -- i.e. closing in current
location and moving the region off to another server w/ some outage
while the move is happening (usually seconds) -- or you could split
the hot region manually and then the daughters could be moved off to
other servers... Primitive for now but should be better in next HBase

Have you been able to test w/ your data and your query pattern?
That'll tell you way more than I ever could.

Good luck,

> On Mon, Dec 27, 2010 at 1:49 PM, Stack <stack@duboce.net> wrote:
>> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <wav100@gmail.com> wrote:
>> > We are in the process of evaluating hbase in an effort to switch from a
>> > different nosql solution. Performance is of course an important part of
>> our
>> > evaluation. We are a python shop and we are very worried that we can not
>> get
>> > any real performance out of hbase using thrift (and must drop down to
>> java).
>> > We are aware of the various lower level options for bulk insert or java
>> > based inserts with turning off WAL etc. but none of these are available
>> to
>> > us in python so are not part of our evaluation.
>> I can understand python for continuous updates from your frontend or
>> whatever but you might consider hacking up a bit of java to make us of
>> the bulk updater; you'll get upload rates orders of magnitude beyond
>> what you'd achieve going via the API via python (or java for that
>> matter).  You can also do incremental updates using the bulk loader.
>> We have a 10 node cluster
>> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region nodes, and we
>> are
>> > looking for suggestions on configuration as well as benchmarks in terms
>> of
>> > expectations of performance. Below are some specific questions. I realize
>> > there are a million factors that help determine specific performance
>> > numbers, so any examples of performance from running clusters would be
>> great
>> > as examples of what can be done.
>> Yeah, you have been around the block obviously. Its hard to give out
>> 'numbers' since so many different factors involved.
>> Again thrift seems to be our "problem" so
>> > non java based solutions are preferred (do any non java based shops run
>> > large scale hbase clusters?). Our total production cluster size is
>> estimated
>> > to be 50TB.
>> >
>> There are some substantial shops running non-java; e.g. the yfrog
>> folks go via REST, the mozilla fellas are python over thrift,
>> Stumbleupon is php over thrift.
>> > Our data model is 3 CFs, one primary and 2 secondary indexes. All writes
>> go
>> > to all 3 CFs and are grouped as a batch of row mutations which should
>> avoid
>> > row locking issues.
>> >
>> A write updates 3CFs and secondary indices?  Thats an expensive Put
>> relatively.  You have to run w/ 3CFs?  It facilitates fast querying?
>> > What heap size is recommended for master, and for region servers (24gb
>> ram)?
>> Master doesn't take much heap, at least not in the coming 0.90.0 HBase
>> (Is that what you intend to run)?
>> The more RAM you give the regionservers, the more cache your cluster will
>> have.
>> Whats important to you read or write times?
>> > What other settings can/should be tweaked in hbase to optimize
>> performance
>> > (we have looked at the wiki page)?
>> Thats a good place to start.  Take a look through this mailing list
>> for others (Its time for a trawl of mailing list and then distilling
>> the findings into a reedit of our perf page).
>> > What is a good batch size for writes? We will start with 10k
>> values/batch.
>> Start small with defaults.  Make sure its all running smooth first.
>> Then rachet it up.
>> > How many concurrent writers/readers can a single data node handle with
>> > evenly distributed load? Are there settings specific to this?
>> How many clients you going to have writing HBase?
>> > What is "very good" read/write latency for a single put/get in hbase
>> using
>> > thrift?
>> "Very Good" would be < a few milliseconds.
>> > What is "very good" read/write throughput per node in hbase using thrift?
>> >
>> Thousands of ops per second per regionserver (Sorry, can't be more
>> specific than that).  If the Puts are multi-family + updates on
>> secondary indices, hundreds -- maybe even tens... I'm not sure --
>> rather than thousands.
>> > We are looking to get performance numbers in the range of 10k aggregate
>> > inserts/sec/node and read latency < 30ms/read with 3-4 concurrent
>> > readers/node. Can our expectations be met with hbase through thrift? Can
>> > they be met with hbase through java?
>> >
>> I wouldn't fixate on the thrift hop.  At SU we can do thousands of ops
>> a second per node np from PHP frontend over thrift.
>> 10k inserts a second per node into single CF might be doable.  If into
>> 3CFs, then you need to recalibrate your expectations (I'd say).
>> > Thanks in advance for any help, examples, or recommendations that you can
>> > provide!
>> >
>> Sorry, the above is light on recommendations (for reasons cited by
>> Ryan above -- smile).
>> St.Ack

View raw message