hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)
Date Sat, 18 Feb 2012 01:58:35 GMT
As your numbers show.

Dataset
Size 	Hypertable
Queries/s 	HBase
Queries/s 	Hypertable
Latency (ms) 	HBase
Latency (ms)
0.5 TB 	3256.42 	2969.52 	157.221 	172.351
5 TB 	2450.01 	2066.52 	208.972 	247.680

Raw data goes up. Read performance goes down. Latency goes up.

You mentioned you loaded 1/2 trillion records of historical financial
data. The operative word is historical. Your not doing 1/2 trillion
writes every day.

Most of the system that use structured log formats can write very fast
(I am guessing that is what hypertable uses btw). DD writes very fast
as well, but if you want acceptable read latency you are going to need
a good RAM/disk ratio.

Even at 0.5 TB 157.221ms is not a great read latency, so your ability
to write fast has already outstripped your ability to read at a rate
that could support say web application. (I come from a world of 1-5ms
latency BTW).

What application can you support with numbers like that? An email
compliance system where you want to store a ton of data, but only plan
of doing 1 search a day to make an auditor happy? :) This is why I say
your going to end up needing about the same # of nodes because when it
comes time to read this data having a machine with 4Tb of data and 24
GB ram is not going to cut it.

You are right on a couple of fronts
1) being able to load data fast is good (can't argue with that)
2) If hbase can't load X entries that is bad

I really can't imagine that hbase blows up and just stops accepting
inserts at one point. You seem to say its happening and I don't have
time to verify. But if you are at the point where you are getting
175ms random and 85 zipfan latency what are you proving that is
already more data then a server can handle.

http://en.wikipedia.org/wiki/Network_performance

Users browsing the Internet feel that responses are "instant" when
delays are less than 100 ms from click to response[11]. Latency and
throughput together affect the perceived speed of a connection.
However, the perceived performance of a connection can still vary
widely, depending in part on the type of information transmitted and
how it is used.



On Fri, Feb 17, 2012 at 7:25 PM, Doug Judd <doug@hypertable.com> wrote:
> Hi Edward,
>
> The problem is that even if the workload is 5% write and 95% read, if you
> can't load the data, you need more machines.  In the 167 billion insert
> test, HBase failed with *Concurrent mode failure* after 20% of the data was
> loaded.  One of our customers has loaded 1/2 trillion records of historical
> financial market data on 16 machines.  If you do the back-of-the-envelope
> calculation, it would take about 180 machines for HBase to load 1/2
> trillion cells.  That makes HBase 10X more expensive in terms of hardware,
> power consumption, and data center real estate.
>
> - Doug
>
> On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>
>> I would almost agree with prospective. But their is a problem with 'java is
>> slow' theory. The reason is that in a 100 percent write workload gc might
>> be a factor.
>>
>> But in the real world people have to read data and read becomes disk bound
>> as your data gets larger then memory.
>>
>> Unless C++ can make your disk spin faster then java It is a wash. Making a
>> claim that your going to need more servers for java/hbase is bogus. To put
>> it in prospective, if the workload is 5 % write and 95 % read you are
>> probably going to need just the same amount of hardware.
>>
>> You might get some win on the read size because your custom caching could
>> be more efficient in terms of object size in memory and other gc issues but
>> it is not 2 or 3 to one.
>>
>> If a million writes fall into a hypertable forest but it take a billion
>> years to read them back did the writes ever sync :)
>>
>>
>> On Monday, February 13, 2012, Doug Judd <doug@hypertable.com> wrote:
>> > Hey Todd,
>> >
>> > Bulk loading isn't always an option when data is streaming in from a live
>> > application.  Many big data use cases involve massive amounts of smaller
>> > items in the size range of 10-100 bytes, for example URLs, sensor
>> readings,
>> > genome sequence reads, network traffic logs, etc.  If HBase requires 2-3
>> > times the amount of hardware to avoid *Concurrent mode failures*, then
>> that
>> > makes HBase 2-3 times more expensive from the standpoint of hardware,
>> power
>> > consumption, and datacenter real estate.
>> >
>> > What takes the most time is getting the core database mechanics right
>> > (we're going on 5 years now).  Once the core database is stable,
>> > integration with applications such as Solr and others are short term
>> > projects.  I believe that sooner or later, most engineers working in this
>> > space will come to the conclusion that Java is the wrong language for
>> this
>> > kind of database application.  At that point, folks on the HBase project
>> > will realize that they are five years behind.
>> >
>> > - Doug
>> >
>> > On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> >
>> >> Hey Doug,
>> >>
>> >> Want to also run a comparison test with inter-cluster replication
>> >> turned on? How about kerberos-based security on secure HDFS? How about
>> >> ACLs or other table permissions even without strong authentication?
>> >> Can you run a test comparing performance running on top of Hadoop
>> >> 0.23? How about running other ecosystem products like Solbase,
>> >> Havrobase, and Lily, or commercial products like Digital Reasoning's
>> >> Synthesys, etc?
>> >>
>> >> For those unfamiliar, the answer to all of the above is that those
>> >> comparisons can't be run because Hypertable is years behind HBase in
>> >> terms of features, adoption, etc. They've found a set of benchmarks
>> >> they win at, but bulk loading either database through the "put" API is
>> >> the wrong way to go about it anyway. Anyone loading 5T of data like
>> >> this would use the bulk load APIs which are one to two orders of
>> >> magnitude more efficient. Just ask the Yahoo crawl cache team, who has
>> >> ~1PB stored in HBase, or Facebook, or eBay, or many others who store
>> >> hundreds to thousands of TBs in HBase today.
>> >>
>> >> Thanks,
>> >> -Todd
>> >>
>> >> On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd <doug@hypertable.com> wrote:
>> >> > In our original test, we mistakenly ran the HBase test with
>> >> > the hbase.hregion.memstore.mslab.enabled property set to false.  We
>> >> re-ran
>> >> > the test with the hbase.hregion.memstore.mslab.enabled property set
to
>> >> true
>> >> > and have reported the results in the following addendum:
>> >> >
>> >> > Addendum to Hypertable vs. HBase Performance
>> >> > Test<
>> >>
>> http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/>
>> >> >
>> >> > Synopsis: It slowed performance on the 10KB and 1KB tests and still
>> >> failed
>> >> > the 100 byte and 10 byte tests with *Concurrent mode failure*
>> >> >
>> >> > - Doug
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>> >
>>

Mime
View raw message