hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abhishek1015 <abhishek1...@gmail.com>
Subject Re: HFile V2 vs HFile V3
Date Sun, 15 Jun 2014 03:21:14 GMT
Dremel is designed to store a nesting structure of arbitrary depth. They use
repetition and definition levels to be able to reconstruct the nested
structure. However, Bigtable like system such as HBase and Cassandra is a
multi-dimensional sorted map, which maps rowkey, column-family, columnkey,
time-stamp into value. Therefore, both repetition and definition levels are
not required to reconstruct a row. This could be a reason that cassandra is
using a dremel inspired format, rather than implementing dremel itself.

We can also visualize this sorted map as a table structure with columns as
"rowkey", "column-family:columnkey" and values as "time-stamp,value". The
HFile is designed with the assumption that hbase table structure is very
sparse. This assumption is true in many cases where columnkey is also used
to store some information (e.g. order_id). However, this assumption is not
true for all tables. In many cases, we use columnkey as traditional column
name.

Therefore, it will be good to have two file formats. Based on sparsity of
table, user can choose between the traditional hfile and a columnar format.
As a lot of companies are using Hbase, I am wondering if any company will be
interested in sharing their anonymized production trace so that I can
estimate the sparsity of their table to validate my argument.

Thanks
Abhishek  





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060450.html
Sent from the HBase User mailing list archive at Nabble.com.

Mime
View raw message