hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <eric...@gmail.com>
Subject Re: Any successful story of an HBasecell for 'analytics job' plus 'realtime serving'?
Date Sun, 04 Jul 2010 05:23:18 GMT
My project is currently in prototyping stage.  The machines are
running in vmware on a 2006 era Core Duo iMac.  2 virtual machines and
the host itself, and the insertion rate is around 300kbytes/s to hbase
per node.  The low number is due to limitation of virtualized disk io.
 It is a lot faster for hbase's raw insertion rate, but my code is not
fully optimized and my data generation is kept at minimum level
(300kbytes/s).  Chukwa collector raw throughput is about 10-15
mbytes/s per node for writing sequence file on 2Ghz xeon server build
around 2007.  I think with some optimization, I can probably get
closer to half of the performance for writing sequence file.  In
writing sequence file, the data is written as raw bytes without
parsing.  In Chukwa Hbase Writer, it is using the same demux parser
which we run in our mapreduce job.  There are some fixed cost
associated with the parsing and filtering, but data is available for
viewing in 50-100ms for hbase implementation is surely a winner.
Without using hbase, the data had to wait 1 minute to deposit to hdfs
(file close on hdfs), and 5 minutes wait (worse case) for demux
mapreduce job to kick in (waiting duration for enough data to make
mapreduce worthwhile), and 2 minutes for mapreduce job to finish.  By
the time data was available, it was 5-10 minutes latency.

For vertical aggregation, analytics which involves a lot of rows, it
used to be 5-10 minutes for raw data availability + mapreduce job
duration.  Now, mapreduce job can run at fixed interval, and data is
available after mapreduce job duration.

The current implementation of hbase table mapreduce, the input table
can not be the same as the output table, hence, it is unlikely to
impact query performance, if user is query the same table as the input
table.  Insertion from collector and query by user could happen in
parallel without impacting each other's performance much.  I imagine
that it should be fine to look at output table and having user query
on the same table.

Your use case is definitely possible with hbase.  It is aligned with
my current research for aggregation trend.  Full data availability is
base on mapreduce job length, not real time.

regards,
Eric

On Sat, Jul 3, 2010 at 5:37 PM, Sean Bigdatafun
<sean.bigdatafun@gmail.com> wrote:
> Hi Eric,
>
> Thanks for sharing the application. I have two questions about your
> scenario:
>
> 1) It looks like you tapped Chukwa's monitoring logs directly into an HBase
> table, how big is your HBase cell (how many servers) and what is your
> throughput of incoming log stream?
> 2) It looks like you have not done the MapReduce part though you have read
> the javadoc, right? If that is the case, have you thought of the case that a
> heavy MapReduce analytics job pegs the HBase cell so heavily that its query
> serving degragates so much that the end user experience becomes so bad
> (i.e., the query latency becomes so high because of data crunching)?
>
>
> What I am thinking of is the following scenario:
> -- 1) I want to store my hourly web traffic into a fact table hourly into
> Table A
> -- 2) I want to invoke map-reduce to generate aggregated table like
> trends/web-usage-summary into Table B
> -- 3) I want to serve end user's query from Table B.
>
> Thanks,
> Sean
> On Sat, Jul 3, 2010 at 4:53 PM, Eric Yang <eric818@gmail.com> wrote:
>
>> Hi Sean,
>>
>> I am writing an interface for Chukwa to inject data directly into
>> hbase and relay on hbase to index my data by time group/row key.  It
>> is working fine for me.  I could tap into the realtime data sink table
>> to monitor the data arrival and create simple visualization.  The only
>> minor problem is by default the cell has return the most recent three
>> revisions back to me instead of 60 versions that I put into the
>> system.  I am sure it's something simple that I missed.
>>
>> The next step is to use TableInput and TableOutput for mapreduce to
>> process analytic computation for my large time series trends.  From
>> what I gather from hbase javadoc, it looks very promising and simple
>> to implement.  With hbase manages the file structures, indexing, and
>> roll up of files, it is bring chukwa one step closer to become a real
>> time monitoring and reporting application for hadoop.  Being a silent
>> observer on hbase, I waited 2 years for big table like storage for
>> hadoop ecosystem, and hbase is the closest in obtaining this goal.
>>
>> Running mapreduce job on hbase is unlikely to be a real time system,
>> since there is a lot of bytes transferring between mapreduce and
>> hbase.  However, if you only need to have near real time experience,
>> like running mapreduce job every 5-30 minutes.  Then it is certainly
>> in the realm of possibility.
>>
>> regards,
>> Eric
>>
>> On Sat, Jul 3, 2010 at 2:42 PM, Sean Bigdatafun
>> <sean.bigdatafun@gmail.com> wrote:
>> > I read a thread "Use cases of Hbase" in March archive, and several people
>> > seemed to suggest that an HBase cell can be used as a mixed cell for data
>> > crunching and online serving (i.e, using Hive Hbase client to do the
>> > analytics part while serving live query, see
>> > http://osdir.com/ml/hbase-user-hadoop-apache/2010-03/msg00299.html), did
>> > someone really have such successful story? I am a little doubtful about
>> that
>> > idea.
>> >
>> > Someone else also implied such use case "Since 0.20.0, results of
>> analytic
>> > computations over the data can be materialized and served out in real
>> time
>> > in response to queries. This is a complete solution."
>> >
>> > Can someone share the experience on such an option?
>> >
>> > Thanks,
>> > Sean
>> >
>>
>

Mime
View raw message