hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vrodio...@carrieriq.com>
Subject RE: HBase 6x bigger than raw data
Date Mon, 27 Jan 2014 22:43:30 GMT
Overhead of storing small values is quite high in HBase unless you use DATA_BLOCK_ENCODING
(not available in 0.92). I recommend you moving to 0.94.latest.

See: https://issues.apache.org/jira/browse/HBASE-4218

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Nick Xie [nick.xie.hadoop@gmail.com]
Sent: Monday, January 27, 2014 2:40 PM
To: user@hbase.apache.org
Subject: Re: HBase 6x bigger than raw data

Tom,

Yes, you are right. According to this analysis (
http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html)
if it is right, then the overhead is quite big if the cell value
occupies
a small portion.

In the analysis in that link, the overhead is actually 10x!!!!(the real
values only takes 12B and it costs 123B in HBase to store them...) Is that
real????

In this case, should we do some combination to reduce the overhead?

Thanks,

Nick




On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <tombrown52@gmail.com> wrote:

> I believe each cell stores its own copy of the entire row key, column
> qualifier, and timestamp. Could that account for the increase in size?
>
> --Tom
>
>
> On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <nick.xie.hadoop@gmail.com>
> wrote:
>
> > I'm importing a set of data into HBase. The CSV file contains 82 entries
> > per line. Starting with 8 byte ID, followed by 16 byte date and the rest
> > are 80 numbers with 4 bytes each.
> >
> > The current HBase schema is: ID as row key, date as a 'date' family with
> > 'value' qualifier, the rest is in another family called 'readings' with
> > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> >
> > I'm testing this on a single node cluster with HBase running in pseudo
> > distributed mode (no replication, no compression for HBase)...After
> > importing a CSV file with 150MB of size in HDFS(no replication), I
> checked
> > the the table size, and it shows ~900MB which is 6x times larger than it
> is
> > in HDFS....
> >
> > Why there is so large overhead on this? Am I doing anything wrong here?
> >
> > Thanks,
> >
> > Nick
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments
hereto, may be confidential and is intended to be read only by the individual or entity to
whom this message is addressed. If the reader of this message is not the intended recipient
or an agent or designee of the intended recipient, please note that any review, use, disclosure
or distribution of this message or its attachments, in any form, is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com
and delete or destroy any copy of this message and its attachments.

Mime
View raw message