chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: [DISCUSSION] Making HBaseWriter default
Date Wed, 24 Nov 2010 18:04:43 GMT
> Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore,
the byte order of string sorting works fine.

I don't think this is correct. If your row keys are strings, you'd get
an ordering like this:

1000-hostname
200-hostname
3000-hostname

For the use case I was concerned about, I think it would be solved my
making the row key a long timestamp and the data-type a column family.
Then you could something similar to what you described:

Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890,
STOPROW => 1234597890 };

I'm not sure how to do the same thing though if you want to partition
by both hostname and datatype.


On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <eyang@yahoo-inc.com> wrote:
> It is more efficient because there is no need to wait for the file to be
> closed before the map reduce job can be launched.  Data type is grouped into
> a hbase table or column families.  The choice is in the hand of parser
> developer.  Rowkey is a combination of timestamp+primary key as string.  I.e
> 1234567890-hostname.  Therefore, the byte order of string sorting works
> fine.
>
> There are two ways to deal with this problem, it can be scanned using
> StartRow feature in Hbase to narrow down the row range, or use Hbase
> timestamp field to control the scanning range.  Hbase timestamp is a special
> numeric field.
>
> To translate your query to hbase:
>
> Scan “<data_type>”, { STARTROW => ‘timestamp’ };
>
> Or
>
> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890
};
>
> The design is up to the parser designer.  FYI, Hbase shell doesn’t support
> timestamp range query, but the java api does.
>
> Regards,
> Eric
>
> On 11/22/10 10:38 PM, "Bill Graham" <billgraham@gmail.com> wrote:
>
> I see plenty of value in the HBase approach, but I'm still not clear
> on how the time and data type partitioning would be done more
> efficiently within HBase when running a job on a specific 5 minute
> interval for a given data type. I've only used HBase briefly so I
> could certainly be missing something, but I thought the sort for range
> scans is by byte order, which works for string types, but not numbers.
>
> So if your row ids are are <timestamp>/<data_type>, how do you fetch
> all the data for a given data_type for a given time period without
> potentially scanning many unnecessary rows? The timestamps will be in
> alphabetical order, not numeric and data_types would be mixed.
>
> Under the current scheme, since partitioning is done in HDFS you could
> just get <data_type>/<time>/part-* to get exactly the records you're
> looking for.
>
>
> On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <eyang@yahoo-inc.com> wrote:
>> Comparison chart:
>>
>>
>> ---------------------------------------------------------------------------
>> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase
>> |
>>
>> ---------------------------------------------------------------------------
>> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data latency         | fixed n Minutes        | 50-100 ms
>> |
>>
>> ---------------------------------------------------------------------------
>> | File Management      | Hourly/Daily Roll Up   | Hbase periodically
>>  |
>> | Cost                 | Mapreduce Job          | spill data to disk
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Record Size          | Small needs to fit     | Data node block
>> |
>> |                      | in java HashMap        | size. (64MB)
>>  |
>>
>> ---------------------------------------------------------------------------
>> | GUI friendly view    | Data needs to be       | drill down to raw
>> |
>> |                      | aggregated first       | data or aggregated
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Demux                | Single reducer         | Write to hbase in
>> |
>> |                      | or creates multiple    | parallel
>>  |
>> |                      | part-nnn files, and    |
>> |
>> |                      | unsorted between files |
>> |
>>
>> ---------------------------------------------------------------------------
>> | Demux Output         | Sequence file          | Hbase Table
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data analytics tools | Mapreduce/Pig          | MR/Pig/Hive/Cascading
>> |
>>
>> ---------------------------------------------------------------------------
>>
>> Regards,
>> Eric
>>
>> On 11/22/10 3:05 PM, "Ahmed Fathalla" <afathalla@gmail.com> wrote:
>>
>>> I think what we need to do is create some kind of comparison table
>>> contrasting the merits of each approach (HBase vs Normal Demux
>>> processing).
>>> This exercise will be both useful in making the decision of choosing the
>>> default and for documentation purposes to illustrate the difference for
>>> new
>>> users.
>>>
>>>
>>> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <billgraham@gmail.com>
>>> wrote:
>>>
>>>> We are going to continue to have use cases where we want log data
>>>> rolled up into 5 minute, hourly and daily increments in HDFS to run
>>>> map reduce jobs on them. How will this model work with the HBase
>>>> approach? What process will aggregate the HBase data into time
>>>> increments like the current demux and hourly/daily rolling processes
>>>> do? Basically, what does the time partitioning look like in the HBase
>>>> storage scheme?
>>>>
>>>>> My concern is that the demux process is going to become two parallel
>>>>> tracks, one works in mapreduce, and another one works in collector.  It
>>>>> becomes difficult to have clean efficient parsers which works in both
>>>>
>>>> This statement makes me concerned that you're implying the need to
>>>> deprecate the current demux model, which is very different than making
>>>> one or the other the default in the configs. Is that the case?
>>>>
>>>>
>>>>
>>>> On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <eyang@yahoo-inc.com> wrote:
>>>>> MySQL support has been removed from Chukwa 0.5.  My concern is that
the
>>>> demux process is going to become two parallel tracks, one works in
>>>> mapreduce, and another one works in collector.  It becomes difficult to
>>>> have
>>>> clean efficient parsers which works in both places.  From architecture
>>>> perspective, incremental updates to data is better than batch processing
>>>> for
>>>> near real time monitoring purpose.  I like to ensure Chukwa framework
>>>> can
>>>> deliver Chukwa's mission statement, hence I standby Hbase as default.  I
>>>> was
>>>> playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very
>>>> impressed
>>>> by both speed and performance of this combination.  I encourage people
>>>> to
>>>> try it out.
>>>>>
>>>>> Regards,
>>>>> Eric
>>>>>
>>>>> On 11/22/10 10:50 AM, "Ariel Rabkin" <asrabkin@gmail.com> wrote:
>>>>>
>>>>> I agree with Bill and Deshpande that we ought to make clear to users
>>>>> that they don't nee HICC, and therefore don't need either MySQL or
>>>>> HBase.
>>>>>
>>>>> But I think what Eric meant to ask was which of MySQL and HBase ought
>>>>> to be the default *for HICC*.  My sense is that the HBase support
>>>>> isn't quite mature enough, but it's getting there.
>>>>>
>>>>> I think HBase is ultimately the way to go. I think we might benefit as
>>>>> a community by doing a 0.5 release first, while waiting for the
>>>>> pig-based aggregation support that's blocking HBase.
>>>>>
>>>>> --Ari
>>>>>
>>>>> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak
>>>>> <ddeshpande@verisign.com> wrote:
>>>>>> I agree. Making HBase by default would make some Chukwa users life
>>>> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as
>>>> a
>>>> Log Streaming framework. I have plugged in my own writer to write log
>>>> files
>>>> in Local File system (instead of HDFS). I evaluated Chukwa with other
>>>> frameworks and Chukwa had very good fault tolerance built in than other
>>>> frameworks. This made me recommend Chukwa over other frameworks.
>>>>>>
>>>>>> By making HBase default option would definitely make my life difficult
>>>> :).
>>>>>>
>>>>>> Thanks,
>>>>>> Deepak Deshpande
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ari Rabkin asrabkin@gmail.com
>>>>> UC Berkeley Computer Science Department
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Ahmed Fathalla
>>>
>>
>>
>
>

Mime
View raw message