hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Staudinger <ustaudin...@activequant.org>
Subject Re: Unexpected Data insertion time and Data size explosion
Date Mon, 05 Dec 2011 15:13:53 GMT
the point, I refer to is not so much about when hbase's server side
flushes, but when the client side flushes.
If you put every value immediately, it will result every time in an RPC
call. If you collect the data on the client side and flush (on the client
side) manually, it will result in one RPC call with hundred or thousand
small puts inside, instead of hundred or thousands individual put RPC
calls.

Another issue is, I am not so sure what happens if you collect hundreds of
thousands of small puts, which might possibly be bigger than the memstore,
and flush then. I guess the hbase client will hang.




On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com>wrote:

> Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
> the bulk insert ??? I was of the opinion that Hbase would flush all the
> puts to the disk when it's memstore is filled, whose property is defined in
> hbase-default.xml. Is my understanding wrong here ???
>
>
>
> On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> ustaudinger@activequant.org> wrote:
>
> > Hi there,
> >
> > while I cannot give you any concrete advice on your particular storage
> > problem, I can share some experiences with you regarding performance.
> >
> > I also bulk import data regularly, which is around 4GB every day in about
> > 150 files with something between 10'000 to 30'000 lines in it.
> >
> > My first approach was to read every line and put it separately. Which
> > resulted in a load time of about an hour. My next approach was to read an
> > entire file, put each individual put into a list and then store the
> entire
> > list at once. This works fast in the beginning, but after about 20 files,
> > the server ran into compactions and couldn't cope with the load and
> > finally, the master crashed, leaving regionserver and zookeeper running.
> To
> > HBase's defense, I have to say that I did this on a standalone
> installation
> > without Hadoop underneath, so the test may not be entirely fair.
> > Next, I switched to a proper Hadoop layer with HBase on top. I now also
> put
> > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
> > insert times of around 0.5ms per row - which is very decent. My entire
> > import now takes only 7 minutes.
> >
> > I think you must find a balance regarding the performance of your servers
> > and how quick they are with compactions and the amount of data you put at
> > once. I have definitely found single puts to result in low performance.
> >
> > Best regards,
> > Ulrich
> >
> >
> >
> >
> >
> > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
> > >wrote:
> >
> > > No, I split the table on the fly. This I have done because converting
> my
> > > table into Hbase format (rowID, family, qualifier, value) would result
> in
> > > the input file being arnd 300GB. Hence, I had decided to do the
> splitting
> > > and generating this format on the fly.
> > >
> > > Will this effect the performance so heavily ???
> > >
> > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhihong@gmail.com> wrote:
> > >
> > > > May I ask whether you pre-split your table before loading ?
> > > >
> > > >
> > > >
> > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > > > machines
> > > > > and am trying to insert data. 3 of the machines are tasktrackers,
> > with
> > > 4
> > > > > map tasks each.
> > > > >
> > > > >    My data consists of about 1.3 billion rows with 4 columns each
> > > (100GB
> > > > > txt file). The column structure is "rowID, word1, word2, word3".
>  My
> > > DFS
> > > > > replication in hadoop and hbase is set to 3 each. I have put only
> one
> > > > > column family and 3 qualifiers for each field (word*).
> > > > >
> > > > >    I am using the SampleUploader present in the HBase distribution.
> > To
> > > > > complete 40% of the insertion, it has taken around 21 hrs and it's
> > > still
> > > > > running. I have 12 map tasks running.* I would like to know is the
> > > > > insertion time taken here on expected lines ??? Because when I used
> > > > lucene,
> > > > > I was able to insert the entire data in about 8 hours.*
> > > > >
> > > > >    Also, there seems to be huge explosion of data size here. With
a
> > > > > replication factor of 3 for HBase, I was expecting the table size
> > > > inserted
> > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
> > 300GB
> > > > for
> > > > > replicating the data 3 times and 50+ GB for additional storage
> > > > > information). But even for 40% completion of data insertion, the
> > space
> > > > > occupied is around 550GB (Looks like it might take around 1.2TB for
> > an
> > > > > 100GB file).* I have used the rowID to be a String, instead of
> Long.
> > > Will
> > > > > that account for such rapid increase in data storage???
> > > > > *
> > > > >
> > > > > Regards,
> > > > > Kranthi
> > > >
> > >
> > >
> > >
> > > --
> > > Kranthi Reddy. B
> > >
> > > http://www.setusoftware.com/setu/index.htm
> > >
> >
>
>
>
> --
> Kranthi Reddy. B
>
> http://www.setusoftware.com/setu/index.htm
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message