hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Shinde <ash...@strandls.com>
Subject Re: Bulk upload with multiple reducers with hbase-0.90.0
Date Sat, 22 Jan 2011 06:17:58 GMT
Yes I did try LZO compression it helped, however the resultant disk
usage was on par with uncompressed text size. 

Writing out our serialized data records as batches as a single row with
a single column and LZO compression enabled resulted in the data
getting compressed to 25-30% of original size. 

The impressive thing was that with the above approach the number of
rows on our test data reduced from 12 million to 12 K. However the
insert times were very similar again indicating that hbase inserts
times are sort of irrespective of the current table size. 

On a side note I needed to run major_compaction to get the data
compressed. Bulk upload did not write out data compressed. 

Am I missing something?

Thanks and regards,
 - Ashish


 On Thu, 20 Jan 2011 21:23:03 -0800
Ted Dunning <tdunning@maprtech.com> wrote:

> Were you using LZO?  Repetitive keys should compress almost to
> nothing.
> 
> On Thu, Jan 20, 2011 at 8:48 PM, Ashish Shinde <ashish@strandls.com>
> wrote:
> 
> > Hi Stack,
> >
> > Yes makes sense. Will approach it from our needs perspective.
> >
> > I tried using a prebaked table and a reasonable partioner with very
> >  promising results in terms of insert times.
> >
> > However the size of a 1.6 GB test file after import resulted in a
> > hbase folder roughly 6 GB. Although in most cases people are not
> > disk size sensitive, we would really like to keep disk usage at a
> > minimum.
> >
> > The nature of the data required me to create a rowkey that was 100
> > bytes long. An examination of the table's datablock's revealed that
> > every column in the datablock is proceeded by the rowkey, and in our
> > case this results an overhead of 6 times. Am I doing something
> > obviously wrong?
> >
> > Serializing the row into a single hbase column brought the disk
> > usage under wraps. Another approach I tried was to club a number of
> > rows into a single hbase row and used a different indexing scheme
> > with a simple long rowkey. This provided the best performance and
> > the used the least amount of disk space.
> >
> > Our data is immutable at least as much as I can for see. Is the
> > serialized row the best option I have? Does the number of rows in a
> > table affect read performance. If this is the case then clubbing
> > rows seems the be a reasonable option.
> >
> > Thanks and regards,
> >  - Ashish
> >
> >
> > On Wed, 19 Jan 2011 22:16:33 -0800
> > Stack <stack@duboce.net> wrote:
> >
> > > On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde
> > > <ashish@strandls.com> wrote:
> > > > I have to say I am might impressed with hadoop and hbase, the
> > > > overall philosophy and the architecture and have decided to
> > > > contribute as much as time permits. Already looking at the
> > > > "noob" issues on hbase jira :)
> > > >
> > >
> > > I'd say work on your particular need rather than on noob issues.
> > > Thats probably the best contrib. you could make.  Figure out the
> > > blockers -- we'll help out -- that get in the way of your sizeable
> > > incremental bulk uploads.  Your use case makes for a good story.
> > >
> > > Good luck Ashish,
> > > St.Ack
> >
> >


Mime
View raw message