hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase 6x bigger than raw data
Date Mon, 27 Jan 2014 23:02:21 GMT
Enabling compression (http://hbase.apache.org/book.html#compression) is
separate from data block encoding (HBASE-4218).

Cheers


On Mon, Jan 27, 2014 at 2:59 PM, Tom Brown <tombrown52@gmail.com> wrote:

> Does enabling compression include prefix compression (HBASE-4218), or is
> there a separate switch for that?
>
> --Tom
>
>
> On Mon, Jan 27, 2014 at 3:48 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > To make better use of block cache, see:
> >
> > HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix
> > compression)
> >
> > which is in 0.94 and above
> >
> > To reduce size of HFiles, please see:
> > http://hbase.apache.org/book.html#compression
> >
> >
> > On Mon, Jan 27, 2014 at 2:40 PM, Nick Xie <nick.xie.hadoop@gmail.com>
> > wrote:
> >
> > > Tom,
> > >
> > > Yes, you are right. According to this analysis (
> > >
> > >
> >
> http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html
> > > )
> > > if it is right, then the overhead is quite big if the cell value
> > > occupies
> > > a small portion.
> > >
> > > In the analysis in that link, the overhead is actually 10x!!!!(the real
> > > values only takes 12B and it costs 123B in HBase to store them...) Is
> > that
> > > real????
> > >
> > > In this case, should we do some combination to reduce the overhead?
> > >
> > > Thanks,
> > >
> > > Nick
> > >
> > >
> > >
> > >
> > > On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <tombrown52@gmail.com>
> wrote:
> > >
> > > > I believe each cell stores its own copy of the entire row key, column
> > > > qualifier, and timestamp. Could that account for the increase in
> size?
> > > >
> > > > --Tom
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <nick.xie.hadoop@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I'm importing a set of data into HBase. The CSV file contains 82
> > > entries
> > > > > per line. Starting with 8 byte ID, followed by 16 byte date and the
> > > rest
> > > > > are 80 numbers with 4 bytes each.
> > > > >
> > > > > The current HBase schema is: ID as row key, date as a 'date' family
> > > with
> > > > > 'value' qualifier, the rest is in another family called 'readings'
> > with
> > > > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> > > > >
> > > > > I'm testing this on a single node cluster with HBase running in
> > pseudo
> > > > > distributed mode (no replication, no compression for HBase)...After
> > > > > importing a CSV file with 150MB of size in HDFS(no replication),
I
> > > > checked
> > > > > the the table size, and it shows ~900MB which is 6x times larger
> than
> > > it
> > > > is
> > > > > in HDFS....
> > > > >
> > > > > Why there is so large overhead on this? Am I doing anything wrong
> > here?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Nick
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message