hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuya6...@gmail.com>
Subject Re: Items to contribute (plan)
Date Wed, 26 Jan 2011 11:32:10 GMT

Hi Ryan, 

>> 2. mapreduce.HFileInputFormat
>> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat
in my tests)
>> Current status: Completed a proof-of-concept prototype and measured performance.

> On Jan 23, 2011, Ryan Rawson wrote:
>> #2 is interesting, what is the benefit? How did you measure said benefit?

I have only performed simplified tests; single test thread on single server. It was even not
a MR job but a simple program that scans through the whole rows in the table. I'll definitely
need deeper tests in a clustering environment to measure more realistic results. 

The related test programs can be found here (V1 is the one):

And the chart comparing throughput on RS, HFileInputFormat and HDFS SequenceFile: 

Please note: The disk drive attached to the EC2 instance was slow, so for this particular
test, I used a small table to fit the whole contents of the files in Linux's disk read cache,
ran each test twice and only recorded second result.  (I restarted RS between first and second
tests to clear its block cache)

One interesting thing I saw in the result  was HDFS SequenceFile didn't scale well in my environment.
SequenceFile needed more processor power than HFile and suffered by the processor bottleneck.
CPU utilization was about 100% for SequenceFile and about 30% for HFile throughout the tests

- Tatsuya

Tatsuya Kawano
Tokyo, Japan

View raw message