hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: HBase mention in VLDB keynote
Date Tue, 25 Aug 2009 21:08:17 GMT
If you are just looking for numbers, they can vary quite drastically 
depending on the cluster configuration, cluster hardware, jvm/gc 
configuration, dataset properties, read patterns, and load patterns. 
The ones I provided in that presentation are on a very small cluster but 
with simple data and low load, my attempt at some getting some base numbers.

You really need to load up some of your own data and see how it behaves 
on your own cluster.  And tuning is increasingly important now as we are 
limited by Java GC quite a bit.

JG

Schubert Zhang wrote:
> @stack
> We know HIVE-705, and already have good communication with the contributor,
> since we are all chinese. :-)
> In fact some code of the patch are used and tested in our project. But we
> need more flexible data store schema to resolve engineering problems,
> especially performance and practicability.
> 
> @andy
> Does ryan's result different from JG's?
> On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <apurtell@apache.org> wrote:
> 
>> Hi Schubert,
>>
>>
>>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of these
>> contradiction?
>>
>> For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>>
>> I'm sure you have seen this already.
>>
>> Ryan has posted some information on the list now and again.
>>
>> Also I think your work with performance evaluation is very important
>> feedback and data points. Thanks for that.
>>
>>> We are doing a interesting thing to make Hive can use HBase as it's data
>> store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
>> and also we can directly query/scan data from HBase.
>>
>> That sounds REALLY interesting!
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Schubert Zhang <zsongbo@gmail.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Tuesday, August 25, 2009 8:26:50 PM
>>  Subject: Re: HBase mention in VLDB keynote
>>
>> hi andy,
>>
>> Even though current HBase is not yet ready for production, but we know it
>> is
>> really testable and evaluation-able for its data model and architecture.
>>
>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of
>> these
>> contradiction?
>>
>> Regards Hive, it's really a good design, especially about its abatraction
>> of
>> MapReduce workflow matched to SQL. Hive made a good success inside
>> Facebook, the report says 29% of Facebook employees use Hive, and 51% of
>> those users are from outside engineering. It should be caused by the easy
>> leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
>> adding features of metadata and sql, which are provided in Hive. But Hive
>> is
>> still not very flexible to use alternate data store than HDFS files. We are
>> doing a interesting thing to make Hive can use HBase as it's data store.
>> Now
>> we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
>> can directly query/scan data from HBase.
>>
>> I believe HBase can be a data store to work as a storage adapter layer
>> above
>> HDFS. It is not a database, it is just a data storage adapter system above
>> HDFS, with a distributed b-tree clustered index. BigTable is designed to
>> provide more easy-used ways to store small data objects and provide
>> random-access, since GFS is designed for
>> sequential-access/batch-processing/large-data storage and GFS is not
>> appropriate to store small data objects and random-access.
>>
>> I also believe HBase can be a data store to let MapReduce over HBase
>> possiable. If we review the Bigtable paper's, especially secetor 8, we can
>> find it is widely used for to do mapreduce analysis/summary in many google
>> applications.
>>
>>
>> In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
>> can
>> find google's new GFS integrated some data models of Bigtable.
>> http://queue.acm.org/detail.cfm?id=1594206
>>
>>
>> Schubert
>>
>> On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
>> bradfordstephens@gmail.com> wrote:
>>
>>> Interesting. I need to see what sort of eval was going on for that
>>> presentation...
>>>
>>> He probably forgot to tweak GC :)
>>>
>>> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <apurtell@apache.org>
>>> wrote:
>>>
>>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> This was one interaction with that group, maybe the only other one
>> aside
>>>> from a question about sizing memstore:
>>>> http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
>>>> Now I wonder if the eval was done via the REST gateway... A followup
>>> might
>>>> be useful. If I run into someone from Yahoo Research here I'll ask.
>>>> Otherwise we should try mailing them, yes.
>>>>
>>>>> Should we try and get into VLDB next year?
>>>> We can certainly submit a candidate paper given a novel contribution of
>>>> some kind which moves the state of the art forward. There are other
>>> venues
>>>> besides VLDB also we can consider. Regardless, I think one of us should
>>>> attend VLDB every year.
>>>>
>>>>> Any thing else interesting at the conference?
>>>> Yes.
>>>>
>>>> ETH Zurich presented a system which tailors consistency to the needs of
>>>> various data items -- "consistency rationing in the cloud: pay only
>> when
>>> it
>>>> matters" -- choosing eventual (session) consistency or pessimistic 2PC
>> on
>>>> demand according to a cost model, with good results. Made me think of
>>>> possibilities with THBase. Also, I watched a demo of HIVE, something I
>>>> hadn't see to date. Their query planner and mapreduce scheduler is
>>>> interesting in concept and in detail. We're looking at Cascading for
>>> batch
>>>> analytics on top of HBase instead, but knowing more about alternatives
>> is
>>>> always good.
>>>>
>>>> The Hadoop-y track is really tomorrow.
>>>>
>>>> Outside of direct relevance to things HBase I attended talks on aspects
>>> of
>>>> data fusion, ETL, and complex event processing / stream processing,
>>> wearing
>>>> my TM hat. Lots of good stuff here.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Stack <saint.ack@gmail.com>
>>>> To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>>>> Sent: Tuesday, August 25, 2009 4:47:57 PM
>>>> Subject: Re: HBase mention in VLDB keynote
>>>>
>>>> The same fella did keynote at apachecon eu on a similar topic.  Then he
>>>> talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
>>> no
>>>> mention.  There the comparison strangely was to couchdb and perhaps
>>>> Cassandra (iirc).
>>>>
>>>> So, mention is an improvement (do you think the kick up the behind I
>>>> rendered him after his amsterdam talk could have had anything to do
>> with
>>>> it?).
>>>>
>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> Should we try and get into vldb next year?
>>>>
>>>> Good stuff Andy.  Any thing else interesting at the conference?
>>>>
>>>> Stack
>>>>
>>>>
>>>>
>>>> On Aug 25, 2009, at 6:17 AM, Andrew Purtell <apurtell@apache.org>
>> wrote:
>>>>> In this keynote address here at VLDB 2009 (
>>>> http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
>>>> Chief Scientist, made prominent mention of HBase, much to my surprise
>>> (and
>>>> later chagrin). This happened near the end of the talk when a number of
>>> the
>>>> new elastic/scalable/"nosql" storage systems were discussed to make
>>> concrete
>>>> some of the architectural and data model points made earlier. The
>>>> alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
>>>> Cassandra. I don't know what version of HBase was used exactly but
>>>> unfortunately the message was "not ready yet". Perhaps it was a
>>>> configuration or provisioning issue but HBase did not really survive
>> the
>>>> evaluation, leading to short hyperbolic performance curves terminating
>> on
>>>> the far left of the various graphs. This was quite disappointing to see
>>> as
>>>> the other alternatives were apparently successfully tested on what can
>> be
>>>> presumed to be the same resources. It stands to reason there
>>>>  is opportunity for HBase to improve here if only we know what that is.
>>> It
>>>> was also a little disappointing that it appears through a mailing list
>>>> search that these issues were not brought to either hbase-dev@ or
>>>> hbase-users@, only a minor question relating to the REST interface.
>>>> Perhaps the community could have identified a specific configuration
>>>> problem, recommended a correction for a deployment/provisioning error,
>> or
>>>> resolved a bug. To future evaluators of HBase, on behalf of the
>> community
>>> I
>>>> humbly request that you share you results, good or bad, so we can take
>>> the
>>>> feedback, or the bug reports and their artifacts (logs, etc.) and
>> improve
>>>> our software.
>>>>> At least, the story has already changed from what was presented today
>>> --
>>>> for example, the multimaster architecture of 0.20 was not presented,
>>> rather
>>>> the older one (circa 0.19); and JG's/Ryan's performance test results
>> for
>>>> 0.20 stand as a contradiction. We should look into opportunities to
>>> produce
>>>> a peer reviewed positive contribution. I think we have opportunities to
>>> take
>>>> some novel approaches in the system itself and/or produce a novel
>>> vertical
>>>> contribution and 0.20 is a good substrate for that.
>>>>> Though this was unfortunately a missed opportunity for a good showing
>>> for
>>>> HBase in particular, the keynote in general was a well formulated
>>>> introduction of the emerging area of "cloud scale" storage / "nosql"
>>> systems
>>>> to the largest elite gathering of database and data processing
>>> researchers
>>>> in the world. The presentation was importantly also a call for
>>> participation
>>>> in the future development and directions of the new and growing "nosql"
>>>> constellation. Such participation, whether it is specific involvement
>>> with
>>>> the HBase project or not, would be and is most welcome as the problems
>> of
>>>> serving data at very large scale under "cloud" constraints is an area
>> of
>>>> both significant challenge and significant promise. HBase like other
>>>> projects in this area are in an early stage of development. They cover
>>> the
>>>> use cases of their creators but, as answers to the larger set of
>>> problems,
>>>> they are not -- that space is untapped and only waiting for creativity
>>> and
>>>> effort. I
>>>>  think I can speak for HBase in particular, we welcome this and would
>> be
>>>> pleased to assist at every opportunity.
>>>>>    - Andy
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media,
>>> and Computer Science
>>>
>>
>>
>>
>>
> 

Mime
View raw message