hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kranthi reddy <kranthili2...@gmail.com>
Subject Re: Unexpected Data insertion time and Data size explosion
Date Mon, 19 Dec 2011 05:54:26 GMT
Hi all,

     I have been able to understand clearly as to why my Storage is
occupying such huge space.

     I have an issue with the insertion time. I have currently .1 billion
records (In hbase format, in future it would run into few billions) and am
inserting them using 12 map tasks running on 4 machine hadoop cluster.

     The time taken is approximately 3 hours. Which on calculation leads to
around 750 rows insertion per map task per second. IS THIS GOOD OR CAN IT
BE IMPROVED???

      .1 billion -> 100000000/( 180 min * 60 sec * 12 map task) = 750
(approx).

 I have tried using batch() function, but there is no improvement in the
insertion time.

* I have attached the codes that I am using to insert. Can some1 please
check If what I am trying to do is the best way to insert data is the
fastest and best way.
*
Regards,
Kranthi



On Mon, Dec 5, 2011 at 11:12 PM, Doug Meil <doug.meil@explorysmedical.com>wrote:

>
> Hi there-
>
> Have you looked at this?
>
> http://hbase.apache.org/book.html#keyvalue
>
>
>
>
>
> On 12/5/11 11:33 AM, "kranthi reddy" <kranthili2020@gmail.com> wrote:
>
> >Ok. But can some1 explain why the data size is exploding the way I have
> >mentioned earlier.
> >
> >I have tried to insert sample data of arnd 12GB. The data occupied by
> >Hbase
> >table is arnd 130GB. All my columns i.e. including the ROWID are strings.
> >I
> >have even tried converting by ROWID to long, but that seems to occupy more
> >space i.e. arnd 150GB.
> >
> >Sample rows
> >
> >0-<>-f-<>-c-<>-Anarchism
> >0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
> >0-<>-f-<>-e2-<>-anarchy
> >1-<>-f-<>-c-<>-Anarchism
> >1-<>-f-<>-e1-<>-anarchy
> >1-<>-f-<>-e2-<>-state (polity)
> >2-<>-f-<>-c-<>-Anarchism
> >2-<>-f-<>-e1-<>-anarchy
> >2-<>-f-<>-e2-<>-political philosophy
> >3-<>-f-<>-c-<>-Anarchism
> >3-<>-f-<>-e1-<>-The Globe and Mail
> >3-<>-f-<>-e2-<>-anarchy
> >4-<>-f-<>-c-<>-Anarchism
> >4-<>-f-<>-e1-<>-anarchy
> >4-<>-f-<>-e2-<>-stateless society
> >
> >Is there a way I can know the number of bytes occupied by each key:value
> >for each cell ???
> >
> >On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
> >ustaudinger@activequant.org> wrote:
> >
> >> the point, I refer to is not so much about when hbase's server side
> >> flushes, but when the client side flushes.
> >> If you put every value immediately, it will result every time in an RPC
> >> call. If you collect the data on the client side and flush (on the
> >>client
> >> side) manually, it will result in one RPC call with hundred or thousand
> >> small puts inside, instead of hundred or thousands individual put RPC
> >> calls.
> >>
> >> Another issue is, I am not so sure what happens if you collect hundreds
> >>of
> >> thousands of small puts, which might possibly be bigger than the
> >>memstore,
> >> and flush then. I guess the hbase client will hang.
> >>
> >>
> >>
> >>
> >> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
> >> >wrote:
> >>
> >> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size"
> >>do
> >> > the bulk insert ??? I was of the opinion that Hbase would flush all
> >>the
> >> > puts to the disk when it's memstore is filled, whose property is
> >>defined
> >> in
> >> > hbase-default.xml. Is my understanding wrong here ???
> >> >
> >> >
> >> >
> >> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> >> > ustaudinger@activequant.org> wrote:
> >> >
> >> > > Hi there,
> >> > >
> >> > > while I cannot give you any concrete advice on your particular
> >>storage
> >> > > problem, I can share some experiences with you regarding
> >>performance.
> >> > >
> >> > > I also bulk import data regularly, which is around 4GB every day in
> >> about
> >> > > 150 files with something between 10'000 to 30'000 lines in it.
> >> > >
> >> > > My first approach was to read every line and put it separately.
> >>Which
> >> > > resulted in a load time of about an hour. My next approach was to
> >>read
> >> an
> >> > > entire file, put each individual put into a list and then store the
> >> > entire
> >> > > list at once. This works fast in the beginning, but after about 20
> >> files,
> >> > > the server ran into compactions and couldn't cope with the load and
> >> > > finally, the master crashed, leaving regionserver and zookeeper
> >> running.
> >> > To
> >> > > HBase's defense, I have to say that I did this on a standalone
> >> > installation
> >> > > without Hadoop underneath, so the test may not be entirely fair.
> >> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
> >>also
> >> > put
> >> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and
> >>have
> >> > > insert times of around 0.5ms per row - which is very decent. My
> >>entire
> >> > > import now takes only 7 minutes.
> >> > >
> >> > > I think you must find a balance regarding the performance of your
> >> servers
> >> > > and how quick they are with compactions and the amount of data you
> >>put
> >> at
> >> > > once. I have definitely found single puts to result in low
> >>performance.
> >> > >
> >> > > Best regards,
> >> > > Ulrich
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
> >><kranthili2020@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > No, I split the table on the fly. This I have done because
> >>converting
> >> > my
> >> > > > table into Hbase format (rowID, family, qualifier, value) would
> >> result
> >> > in
> >> > > > the input file being arnd 300GB. Hence, I had decided to do the
> >> > splitting
> >> > > > and generating this format on the fly.
> >> > > >
> >> > > > Will this effect the performance so heavily ???
> >> > > >
> >> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhihong@gmail.com> wrote:
> >> > > >
> >> > > > > May I ask whether you pre-split your table before loading
?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy
> >><kranthili2020@gmail.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > Hi all,
> >> > > > > >
> >> > > > > >    I am a newbie to Hbase and Hadoop. I have setup
a cluster
> >>of 4
> >> > > > > machines
> >> > > > > > and am trying to insert data. 3 of the machines are
> >>tasktrackers,
> >> > > with
> >> > > > 4
> >> > > > > > map tasks each.
> >> > > > > >
> >> > > > > >    My data consists of about 1.3 billion rows with
4 columns
> >>each
> >> > > > (100GB
> >> > > > > > txt file). The column structure is "rowID, word1, word2,
> >>word3".
> >> >  My
> >> > > > DFS
> >> > > > > > replication in hadoop and hbase is set to 3 each. I
have put
> >>only
> >> > one
> >> > > > > > column family and 3 qualifiers for each field (word*).
> >> > > > > >
> >> > > > > >    I am using the SampleUploader present in the HBase
> >> distribution.
> >> > > To
> >> > > > > > complete 40% of the insertion, it has taken around
21 hrs and
> >> it's
> >> > > > still
> >> > > > > > running. I have 12 map tasks running.* I would like
to know is
> >> the
> >> > > > > > insertion time taken here on expected lines ??? Because
when I
> >> used
> >> > > > > lucene,
> >> > > > > > I was able to insert the entire data in about 8 hours.*
> >> > > > > >
> >> > > > > >    Also, there seems to be huge explosion of data size
here.
> >> With a
> >> > > > > > replication factor of 3 for HBase, I was expecting
the table
> >>size
> >> > > > > inserted
> >> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt
file I
> >>have,
> >> > > 300GB
> >> > > > > for
> >> > > > > > replicating the data 3 times and 50+ GB for additional
storage
> >> > > > > > information). But even for 40% completion of data insertion,
> >>the
> >> > > space
> >> > > > > > occupied is around 550GB (Looks like it might take
around
> >>1.2TB
> >> for
> >> > > an
> >> > > > > > 100GB file).* I have used the rowID to be a String,
instead of
> >> > Long.
> >> > > > Will
> >> > > > > > that account for such rapid increase in data storage???
> >> > > > > > *
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Kranthi
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Kranthi Reddy. B
> >> > > >
> >> > > > http://www.setusoftware.com/setu/index.htm
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Kranthi Reddy. B
> >> >
> >> > http://www.setusoftware.com/setu/index.htm
> >> >
> >>
> >
> >
> >
> >--
> >Kranthi Reddy. B
> >
> >http://www.setusoftware.com/setu/index.htm
>
>
>


-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Mime
View raw message