hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Use cases of HBase
Date Wed, 10 Mar 2010 09:36:07 GMT
> Here what's the exactly the meaning of "materialized"? Would you
> kindly give more details?

Basically what I am saying is the analytic computation can produce a table of a set of answers
to questions which may be asked at some future time. Since HBase 0.20.0, random access to
table data is of low enough latency to host the information directly. So, typically a batch
process of user construction will run using TableInputFormat over raw data and output cooked
results via TableOutputFormat into a table for answering queries later in real time. Depending
on the use case this is usually either called precomputation or materialization. Precomputation
is a generic term. Materialization (as in "materialized views") I believe was coined by Oracle.
These terms are used interchangeably to refer the process of making answers to a set of possible
queries in advance. To be pedantic I should have said precomputation instead of materialization,
because the latter implies occasional automatic update of the cached data by the database
engine. Of course HBase
 does not do that. 

Hope that helped,

   - Andy



----- Original Message ----
> From: Hua Su <huas.su@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Wed, March 10, 2010 1:01:33 AM
> Subject: Re: Use cases of HBase
> 
> Hi Purtel,
> 
> What do you mean by "Since 0.20.0, results of analytic computations over the
> data can be materialized and served out in real time in response to
> queries."? Here what's the exactly the meaning of "materialized"? Would you
> kindly give more details?
> 
> Thanks!
> 
> - Hua
> 
> On Wed, Mar 10, 2010 at 8:12 AM, Andrew Purtell wrote:
> 
> > I came to this discussion late.
> >
> > Ryan and J-D's use case is clearly successful.
> >
> > In addition to what others have said, I think another case where HBase
> > really excels is supporting analytics over Big Data (which I define as on
> > the order of petabyte). Some of the best performance numbers are put up by
> > scanners. There is tight integration with the Hadoop MapReduce framework,
> > not only in terms of API support but also with respect to efficient task
> > distribution over the cluster -- moving computation to data -- and there is
> > a favorable interaction with HDFS's location aware data placement. Moving
> > computation to data like that is one major reason how analytics using the
> > MapReduce paradigm can put conventional RDBMS/data warehouses to shame for
> > substantially less cost. Since 0.20.0, results of analytic computations over
> > the data can be materialized and served out in real time in response to
> > queries. This is a complete solution.
> >
> 
> 
> 
> 
> >
> >   - Andy
> >
> >
> >
> > ----- Original Message ----
> > > From: Ryan Rawson 
> > > To: hbase-user@hadoop.apache.org
> > > Sent: Tue, March 9, 2010 3:34:55 PM
> > > Subject: Re: Use cases of HBase
> > >
> > > HBase operates more like a write-thru cache.  Recent writes are in
> > > memory (aka memstore).  Older data is in the block cache (by default
> > > 20% of Xmx).  While you can rely on os buffering, you also want a
> > > generous helping of block caching directly in HBase's regionserver.
> > > We are seeing great performance, and our 95th percentiles seem to be
> > > related to GC pauses.
> > >
> > > So to answer your use case below, the answer is most decidedly 'yes'.
> > > Recent values are in memory, also read from memory as well.
> > >
> > > -ryan
> > >
> > > On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner
> > > wrote:
> > > > Ryan, your confidence has me interested in exploring HBase a bit
> > further for
> > > > some real-time functionality that we're building out. One question
> > about the
> > > > mem-caching functionality in HBase... Is it write-through or write-back
> > such
> > > > that all frequently written items are likely in memory, or is it
> > > > pull-through via a client query? Or would I be relying on lower level
> > > > caching features of the OS and underlying filesystem? In other words,
> > where
> > > > there are a high number of both reads and writes, and where 90% of all
> > the
> > > > reads are on recently (5 minutes) written datums would the HBase
> > > > architecture help ensure that the most recently written data is already
> > in
> > > > the cache?
> > > >
> > > > On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote:
> > > >
> > > >> One thing to note is that 10GB is half the memory of a reasonable
> > > >> sized machine. In fact I have seen 128 GB memcache boxes out there.
> > > >>
> > > >> As for performance, I obviously feel HBase can be performant for real
> > > >> time queries.  To get a consistent response you absolutely have to
> > > >> have 95%+ caching in ram. There is no way to achieve 1-2ms responses
> > > >> from disk. Throwing enough ram at the problem, I think HBase solves
> > > >> this nicely and you won't have to maintain multiple architectures.
> > > >>
> > > >> -ryan
> > > >>
> > > >> On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote:
> > > >> > Brian,
> > > >> >
> > > >> > I would just reiterate what others have said.  If you're goal
is a
> > > >> > consistent 1-2ms read latency and your dataset is on the order
of
> > 10GB...
> > > >> > HBase is not a good match.  It's more than what you need and
you'll
> > take
> > > >> > unnecessary performance hits.
> > > >> >
> > > >> > I would look at some of the simpler KV-style stores out there
like
> > Tokyo
> > > >> > Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis.
> > > >> >
> > > >> > JG
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: jaxzin [mailto:Brian.R.Jackson@espn3.com]
> > > >> > Sent: Tuesday, March 09, 2010 12:09 PM
> > > >> > To: hbase-user@hadoop.apache.org
> > > >> > Subject: Re: Use cases of HBase
> > > >> >
> > > >> >
> > > >> > Gary, I looked at your presentation and it was very helpful.
 But I
> > do
> > > >> have
> > > >> > a
> > > >> > few unanswered questions from it if you wouldn't mind answering
> > them.
> > > >> How
> > > >> > big is/was your cluster that handled 3k req/sec?  And what were
the
> > specs
> > > >> on
> > > >> > each node (RAM/CPU)?
> > > >> >
> > > >> > When you say latency can be good, what you mean?  Is it even
in the
> > > >> ballpark
> > > >> > of 1 ms?  Because we already deal with the GC and don't expect
> > perfect
> > > >> > real-time behavior.  So that might be okay with me.
> > > >> >
> > > >> > P.S. I was at Hadoop World NYC and saw Ryan and Jonathan's
> > presentation
> > > >> > there but somehow mentally blocked it.  Thanks for the reminder.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Gary Helmling wrote:
> > > >> >>
> > > >> >> Hey Brian,
> > > >> >>
> > > >> >> We use HBase to complement MySQL in serving activity-stream
type
> > data
> > > >> here
> > > >> >> at Meetup.  It's handling real-time requests involved in
20-25% of
> > our
> > > >> >> page
> > > >> >> views, but our latency requirements aren't as strict as yours.
 For
> > what
> > > >> >> it's worth, I did a presentation on our setup which will
hopefully
> > fill
> > > >> in
> > > >> >> some details: http://www.slideshare.net/ghelmling/hbase-at-meetup
> > > >> >>
> > > >> >> There are also some great presentations by Ryan Rawson and
Jonathan
> > Gray
> > > >> >> on
> > > >> >> how they've used HBase for realtime serving on their sites.
 See
> > the
> > > >> >> presentations wiki page:
> > > >> >> http://wiki.apache.org/hadoop/HBase/HBasePresentations
> > > >> >>
> > > >> >> Like Barney, I suspect where you'll hit some issues will
be in your
> > > >> >> latency
> > > >> >> requirements.  Depending on how you layout your data and
configure
> > your
> > > >> >> column families, your average latency may be good, but you
will hit
> > some
> > > >> >> pauses as I believe reads block at times during region splits
or
> > > >> >> compactions
> > > >> >> and memstore flushes (unless you have a fairly static data
set).
> >  Others
> > > >> >> here should be able to fill in more details.
> > > >> >>
> > > >> >> With a relatively small dataset, you may want to look at
the "in
> > memory"
> > > >> >> configuration option for your column families.
> > > >> >>
> > > >> >> What's your expected workload -- writes vs. reads?  types
of reads
> > > >> you'll
> > > >> >> be
> > > >> >> doing: random access vs. sequential?  There are a lot of
> > knowledgeable
> > > >> >> folks
> > > >> >> here to offer advice if you can give us some more insight
into what
> > > >> you're
> > > >> >> trying to build.
> > > >> >>
> > > >> >> --gh
> > > >> >>
> > > >> >>
> > > >> >> On Tue, Mar 9, 2010 at 11:21 AM, jaxzin
> > > >> wrote:
> > > >> >>
> > > >> >>>
> > > >> >>> This is exactly the kind of feedback I'm looking for
thanks,
> > Barney.
> > > >> >>>
> > > >> >>> So its sounds like you cache the data you get from HBase
in a
> > > >> >>> session-based
> > > >> >>> memory?  Are you using a Java EE HttpSession? (I'm less
familiar
> > with
> > > >> >>> django/rails equivalent but I'm assuming they exist)
 Or are you
> > using
> > > >> a
> > > >> >>> memory cache provider like ehcache or memcache(d)?
> > > >> >>>
> > > >> >>> Can you tell me more about your experience with latency
and why
> > you say
> > > >> >>> that?
> > > >> >>>
> > > >> >>>
> > > >> >>> Barney Frank wrote:
> > > >> >>> >
> > > >> >>> > I am using Hbase to store visitor level clickstream-like
data.
> >  At
> > > >> the
> > > >> >>> > beginning of the visitor session I retrieve all
the previous
> > session
> > > >> >>> data
> > > >> >>> > from hbase and use it within my app server and massage
it a
> > little
> > > >> and
> > > >> >>> > serve
> > > >> >>> > to the consumer via web services.  Where I think
you will run
> > into
> > > >> the
> > > >> >>> > most
> > > >> >>> > problems is your latency requirement.
> > > >> >>> >
> > > >> >>> > Just my 2 cents from a user.
> > > >> >>> >
> > > >> >>> > On Tue, Mar 9, 2010 at 9:45 AM, jaxzin
> > > >> >>> wrote:
> > > >> >>> >
> > > >> >>> >>
> > > >> >>> >> Hi all, I've got a question about how everyone
is using HBase.
> >  Is
> > > >> >>> anyone
> > > >> >>> >> using its as online data store to directly back
a web service?
> > > >> >>> >>
> > > >> >>> >> The text-book example of a weblink HBase table
suggests there
> > would
> > > >> be
> > > >> >>> an
> > > >> >>> >> associated web front-end to display the information
in that
> > HBase
> > > >> >>> table
> > > >> >>> >> (ex.
> > > >> >>> >> search results page), but I'm having trouble
finding evidence
> > that
> > > >> >>> anyone
> > > >> >>> >> is
> > > >> >>> >> servicing web traffic backed directly by an
HBase instance in
> > > >> >>> practice.
> > > >> >>> >>
> > > >> >>> >> I'm evaluating if HBase would be the right tool
to provide a
> > few
> > > >> >>> things
> > > >> >>> >> for
> > > >> >>> >> a large-scale web service we want to develop
at ESPN and I'd
> > really
> > > >> >>> like
> > > >> >>> >> to
> > > >> >>> >> get opinions and experience from people who
have already been
> > down
> > > >> >>> this
> > > >> >>> >> path.  No need to reinvent the wheel, right?
> > > >> >>> >>
> > > >> >>> >> I can tell you a little about the project goals
if it helps
> > give you
> > > >> >>> an
> > > >> >>> >> idea
> > > >> >>> >> of what I'm trying to design for:
> > > >> >>> >>
> > > >> >>> >> 1) Highly available (It would be a central service
and an
> > outage
> > > >> would
> > > >> >>> >> take
> > > >> >>> >> down everything)
> > > >> >>> >> 2) Low latency (1-2 ms, less is better, more
isn't acceptable)
> > > >> >>> >> 3) High throughput (5-10k req/sec at worse case
peak)
> > > >> >>> >> 4) Unstable traffic (ex. Sunday afternoons during
football
> > season)
> > > >> >>> >> 5) Small data...for now (< 10 GB of total
data currently, but
> > HBase
> > > >> >>> could
> > > >> >>> >> allow us to design differently and store more
online)
> > > >> >>> >>
> > > >> >>> >> The reason I'm looking at HBase is that we've
solved many of
> > our
> > > >> >>> scaling
> > > >> >>> >> issues with the same basic concepts of HBase
(sharding,
> > flattening
> > > >> >>> data
> > > >> >>> >> to
> > > >> >>> >> fit in one row, throw away ACID, etc) but with
home-grown
> > software.
> > > >> >>> I'd
> > > >> >>> >> like to adopt an active open-source project
if it makes sense.
> > > >> >>> >>
> > > >> >>> >> Alternatives I'm also looking at: RDBMS fronted
with Websphere
> > > >> eXtreme
> > > >> >>> >> Scale, RDBMS fronted with Hibernate/ehcache,
or (the option I
> > > >> >>> understand
> > > >> >>> >> the
> > > >> >>> >> least right now) memcached.
> > > >> >>> >>
> > > >> >>> >> Thanks,
> > > >> >>> >> Brian
> > > >> >>> >> --
> > > >> >>> >> View this message in context:
> > > >> >>> >>
> > http://old.nabble.com/Use-cases-of-HBase-tp27837470p27837470.html
> > > >> >>> >> Sent from the HBase User mailing list archive
at Nabble.com.
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >
> > > >> >>> >
> > > >> >>>
> > > >> >>> --
> > > >> >>> View this message in context:
> > > >> >>> http://old.nabble.com/Use-cases-of-HBase-tp27837470p27838006.html
> > > >> >>> Sent from the HBase User mailing list archive at Nabble.com.
> > > >> >>>
> > > >> >>>
> > > >> >>
> > > >> >>
> > > >> >
> > > >> > --
> > > >> > View this message in context:
> > > >> > http://old.nabble.com/Use-cases-of-HBase-tp27837470p27841193.html
> > > >> > Sent from the HBase User mailing list archive at Nabble.com.
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > ---
> > > > Thanks,
> > > >
> > > > Charles Woerner
> > > >
> >
> >
> >
> >
> >
> >



      


Mime
View raw message