chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: [DISCUSSION] Making HBaseWriter default
Date Tue, 23 Nov 2010 06:38:25 GMT
I see plenty of value in the HBase approach, but I'm still not clear
on how the time and data type partitioning would be done more
efficiently within HBase when running a job on a specific 5 minute
interval for a given data type. I've only used HBase briefly so I
could certainly be missing something, but I thought the sort for range
scans is by byte order, which works for string types, but not numbers.

So if your row ids are are <timestamp>/<data_type>, how do you fetch
all the data for a given data_type for a given time period without
potentially scanning many unnecessary rows? The timestamps will be in
alphabetical order, not numeric and data_types would be mixed.

Under the current scheme, since partitioning is done in HDFS you could
just get <data_type>/<time>/part-* to get exactly the records you're
looking for.


On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <eyang@yahoo-inc.com> wrote:
> Comparison chart:
>
> ---------------------------------------------------------------------------
> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase        
|
> ---------------------------------------------------------------------------
> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa |
> ---------------------------------------------------------------------------
> | Data latency         | fixed n Minutes        | 50-100 ms            
  |
> ---------------------------------------------------------------------------
> | File Management      | Hourly/Daily Roll Up   | Hbase periodically      |
> | Cost                 | Mapreduce Job          | spill data to disk  
   |
> ---------------------------------------------------------------------------
> | Record Size          | Small needs to fit     | Data node block        
|
> |                      | in java HashMap        | size. (64MB)      
     |
> ---------------------------------------------------------------------------
> | GUI friendly view    | Data needs to be       | drill down to raw       |
> |                      | aggregated first       | data or aggregated  
   |
> ---------------------------------------------------------------------------
> | Demux                | Single reducer         | Write to hbase in    
  |
> |                      | or creates multiple    | parallel          
     |
> |                      | part-nnn files, and    |                
        |
> |                      | unsorted between files |                
        |
> ---------------------------------------------------------------------------
> | Demux Output         | Sequence file          | Hbase Table          
  |
> ---------------------------------------------------------------------------
> | Data analytics tools | Mapreduce/Pig          | MR/Pig/Hive/Cascading   |
> ---------------------------------------------------------------------------
>
> Regards,
> Eric
>
> On 11/22/10 3:05 PM, "Ahmed Fathalla" <afathalla@gmail.com> wrote:
>
>> I think what we need to do is create some kind of comparison table
>> contrasting the merits of each approach (HBase vs Normal Demux processing).
>> This exercise will be both useful in making the decision of choosing the
>> default and for documentation purposes to illustrate the difference for new
>> users.
>>
>>
>> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <billgraham@gmail.com> wrote:
>>
>>> We are going to continue to have use cases where we want log data
>>> rolled up into 5 minute, hourly and daily increments in HDFS to run
>>> map reduce jobs on them. How will this model work with the HBase
>>> approach? What process will aggregate the HBase data into time
>>> increments like the current demux and hourly/daily rolling processes
>>> do? Basically, what does the time partitioning look like in the HBase
>>> storage scheme?
>>>
>>>> My concern is that the demux process is going to become two parallel
>>>> tracks, one works in mapreduce, and another one works in collector.  It
>>>> becomes difficult to have clean efficient parsers which works in both
>>>
>>> This statement makes me concerned that you're implying the need to
>>> deprecate the current demux model, which is very different than making
>>> one or the other the default in the configs. Is that the case?
>>>
>>>
>>>
>>> On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <eyang@yahoo-inc.com> wrote:
>>>> MySQL support has been removed from Chukwa 0.5.  My concern is that the
>>> demux process is going to become two parallel tracks, one works in
>>> mapreduce, and another one works in collector.  It becomes difficult to have
>>> clean efficient parsers which works in both places.  From architecture
>>> perspective, incremental updates to data is better than batch processing for
>>> near real time monitoring purpose.  I like to ensure Chukwa framework can
>>> deliver Chukwa's mission statement, hence I standby Hbase as default.  I was
>>> playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed
>>> by both speed and performance of this combination.  I encourage people to
>>> try it out.
>>>>
>>>> Regards,
>>>> Eric
>>>>
>>>> On 11/22/10 10:50 AM, "Ariel Rabkin" <asrabkin@gmail.com> wrote:
>>>>
>>>> I agree with Bill and Deshpande that we ought to make clear to users
>>>> that they don't nee HICC, and therefore don't need either MySQL or
>>>> HBase.
>>>>
>>>> But I think what Eric meant to ask was which of MySQL and HBase ought
>>>> to be the default *for HICC*.  My sense is that the HBase support
>>>> isn't quite mature enough, but it's getting there.
>>>>
>>>> I think HBase is ultimately the way to go. I think we might benefit as
>>>> a community by doing a 0.5 release first, while waiting for the
>>>> pig-based aggregation support that's blocking HBase.
>>>>
>>>> --Ari
>>>>
>>>> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak
>>>> <ddeshpande@verisign.com> wrote:
>>>>> I agree. Making HBase by default would make some Chukwa users life
>>> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a
>>> Log Streaming framework. I have plugged in my own writer to write log files
>>> in Local File system (instead of HDFS). I evaluated Chukwa with other
>>> frameworks and Chukwa had very good fault tolerance built in than other
>>> frameworks. This made me recommend Chukwa over other frameworks.
>>>>>
>>>>> By making HBase default option would definitely make my life difficult
>>> :).
>>>>>
>>>>> Thanks,
>>>>> Deepak Deshpande
>>>>>
>>>>
>>>>
>>>> --
>>>> Ari Rabkin asrabkin@gmail.com
>>>> UC Berkeley Computer Science Department
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Ahmed Fathalla
>>
>
>

Mime
View raw message