hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Xie <nick.xie.had...@gmail.com>
Subject Re: HBase 6x bigger than raw data
Date Mon, 27 Jan 2014 22:40:04 GMT
Tom,

Yes, you are right. According to this analysis (
http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html)
if it is right, then the overhead is quite big if the cell value
occupies
a small portion.

In the analysis in that link, the overhead is actually 10x!!!!(the real
values only takes 12B and it costs 123B in HBase to store them...) Is that
real????

In this case, should we do some combination to reduce the overhead?

Thanks,

Nick




On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <tombrown52@gmail.com> wrote:

> I believe each cell stores its own copy of the entire row key, column
> qualifier, and timestamp. Could that account for the increase in size?
>
> --Tom
>
>
> On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <nick.xie.hadoop@gmail.com>
> wrote:
>
> > I'm importing a set of data into HBase. The CSV file contains 82 entries
> > per line. Starting with 8 byte ID, followed by 16 byte date and the rest
> > are 80 numbers with 4 bytes each.
> >
> > The current HBase schema is: ID as row key, date as a 'date' family with
> > 'value' qualifier, the rest is in another family called 'readings' with
> > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> >
> > I'm testing this on a single node cluster with HBase running in pseudo
> > distributed mode (no replication, no compression for HBase)...After
> > importing a CSV file with 150MB of size in HDFS(no replication), I
> checked
> > the the table size, and it shows ~900MB which is 6x times larger than it
> is
> > in HDFS....
> >
> > Why there is so large overhead on this? Am I doing anything wrong here?
> >
> > Thanks,
> >
> > Nick
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message