hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Bian <weidong....@gmail.com>
Subject Re: HFileInputFormat for MapReduce
Date Fri, 10 Feb 2012 02:39:52 GMT
I also encountered this issue when comparing Hive+HBase with
Hive+HDFS(native hive tables). After some tuning(ensure data locality,
using scan cache,appropriate number of mappers per node etc), Hive+HBase is
around 4~5X slower.
I guess the two main reasons are :
1) HFile repeats keys for each K/V pair, thus more redundant than sequence
files in native hive tables.(in my case, the same table is ~5X in HBase
than in Hive flat files)

2) An additional layer of RPC brought by the HBase API.Tatsuya did a test
of reading HDFS directly and claims it to be ~2.5X faster.(
https://github.com/tatsuya6502/hbase-mr-pof). Thus implementing
HFileInputFormat can be promising if the pitfalls mentioned are tolerable.

Currently we're adopting the periodic exporting HBase to HDFS approach
though, as we need both good performance for random read/write and analysis
jobs.

Mime
View raw message