hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: KeyValue size in bytes compared to store files size
Date Wed, 15 Jan 2014 19:52:15 GMT
There can be a lot of duplication in what ends up in HFiles but 500MB ->
32MB does seem too good to be true.

Could you try writing without GZIP or mess with the hfile reader[1] to see
what your keys look like when at rest in an HFile (and maybe save the
decompressed hfile to compare sizes?)

St.Ack
1. http://hbase.apache.org/book.html#hfile


On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <amits@infolinks.com> wrote:

> I'm talking about the store files size and the ratio between store file
> size and the byte count as counted in PutSortReducer.
>
>
> On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> >
> >
> > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <amits@infolinks.com> wrote:
> >
> > > Hi all,
> > > I'm trying to measure the size (in bytes) of the data I'm about to load
> > > into HBase.
> > > I'm using bulk load with PutSortReducer.
> > > All bulk load data is loaded into new regions and not added to existing
> > > ones.
> > >
> > > In order to count the size of all KeyValues in the Put object I iterate
> > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > After loading the data, I check the region size by summing the
> > > RegionLoad.getStorefileSizeMB().
> > > Counting the Put objects size predicted ~500MB per region but in
> > practice I
> > > got ~32MB per region.
> > > the table uses GZ compression but this cannot be the cause of such a
> > > difference.
> > >
> > > Is counting the Put's KeyValues the correct way to count a row size ?
> Is
> > it
> > > comparable to the store files size ?
> > >
> > > Thanks,
> > > Amit.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message