chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <ey...@yahoo-inc.com>
Subject Re: [DISCUSSION] Making HBaseWriter default
Date Tue, 23 Nov 2010 00:22:42 GMT
Hbase makes life easier with file management on HDFS.  Hbase roll up the data into large file
sets which is more efficient for scanning and random access. HBase supports mapreduce on table
instead of on files.  Therefore, data analytics on hbase is a great improvement and no drawback.
 The data analytics jobs continue to run every n minutes interval, but you don't need to wait
5 minutes for data to arrive in order to start data processing.

Another eliminated limitation was in daily rolling and hourly rolling.  Chukwa used to produce
files periodically, and those files need to be roll up into bigger files and regular append
doesn't work because late arrival data needs to be resorted in the sequence file.  Hence,
we run hourly and daily job which does purely sorting and merging data.  This is somewhat
wasteful of burning cpu cycles without actual good benefits.

Data looks like this in Chukwa Record:
Time Partition/Primary Key/Actual Timestamp - [small hashmap]

Data looks like this in Hbase:
Timestamp/Primary Key - [big hashmap]

Therefore, it's identical, the only difference is scan for data is a lot faster and not burn
cpu cycle for sorting/merging data.  Hbase handles the merging and indexing of data much more
elegantly.

We don't need to make data into different partitions because hbase handles this for us.  We
can  continue to insert data and hbase regional server will partition the data for us and
provide fast scanning.  If the number of records is beyond trillions, it is still possible
to partition table name by date, if user choose to do this.

Bill, you are reading my mind.  I also imply to deprecate the current hybrid model, and make
a cleaner solution that work in the collector.  It would be easier for new comer to adopt.

Regards,
Eric

On 11/22/10 1:19 PM, "Bill Graham" <billgraham@gmail.com> wrote:

We are going to continue to have use cases where we want log data
rolled up into 5 minute, hourly and daily increments in HDFS to run
map reduce jobs on them. How will this model work with the HBase
approach? What process will aggregate the HBase data into time
increments like the current demux and hourly/daily rolling processes
do? Basically, what does the time partitioning look like in the HBase
storage scheme?

> My concern is that the demux process is going to become two parallel
> tracks, one works in mapreduce, and another one works in collector.  It
> becomes difficult to have clean efficient parsers which works in both

This statement makes me concerned that you're implying the need to
deprecate the current demux model, which is very different than making
one or the other the default in the configs. Is that the case?



On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <eyang@yahoo-inc.com> wrote:
> MySQL support has been removed from Chukwa 0.5.  My concern is that the demux process
is going to become two parallel tracks, one works in mapreduce, and another one works in collector.
 It becomes difficult to have clean efficient parsers which works in both places.  From architecture
perspective, incremental updates to data is better than batch processing for near real time
monitoring purpose.  I like to ensure Chukwa framework can deliver Chukwa's mission statement,
hence I standby Hbase as default.  I was playing with Hbase 0.20.6+Pig 0.8 branch last weekend,
I was very impressed by both speed and performance of this combination.  I encourage people
to try it out.
>
> Regards,
> Eric
>
> On 11/22/10 10:50 AM, "Ariel Rabkin" <asrabkin@gmail.com> wrote:
>
> I agree with Bill and Deshpande that we ought to make clear to users
> that they don't nee HICC, and therefore don't need either MySQL or
> HBase.
>
> But I think what Eric meant to ask was which of MySQL and HBase ought
> to be the default *for HICC*.  My sense is that the HBase support
> isn't quite mature enough, but it's getting there.
>
> I think HBase is ultimately the way to go. I think we might benefit as
> a community by doing a 0.5 release first, while waiting for the
> pig-based aggregation support that's blocking HBase.
>
> --Ari
>
> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak
> <ddeshpande@verisign.com> wrote:
>> I agree. Making HBase by default would make some Chukwa users life difficult. In
my set up, I don't need HDFS. I am using Chukwa merely as a Log Streaming framework. I have
plugged in my own writer to write log files in Local File system (instead of HDFS). I evaluated
Chukwa with other frameworks and Chukwa had very good fault tolerance built in than other
frameworks. This made me recommend Chukwa over other frameworks.
>>
>> By making HBase default option would definitely make my life difficult :).
>>
>> Thanks,
>> Deepak Deshpande
>>
>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message