hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: Unexpected Data insertion time and Data size explosion
Date Mon, 05 Dec 2011 17:42:39 GMT

Hi there-

Have you looked at this?

http://hbase.apache.org/book.html#keyvalue





On 12/5/11 11:33 AM, "kranthi reddy" <kranthili2020@gmail.com> wrote:

>Ok. But can some1 explain why the data size is exploding the way I have
>mentioned earlier.
>
>I have tried to insert sample data of arnd 12GB. The data occupied by
>Hbase
>table is arnd 130GB. All my columns i.e. including the ROWID are strings.
>I
>have even tried converting by ROWID to long, but that seems to occupy more
>space i.e. arnd 150GB.
>
>Sample rows
>
>0-<>-f-<>-c-<>-Anarchism
>0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
>0-<>-f-<>-e2-<>-anarchy
>1-<>-f-<>-c-<>-Anarchism
>1-<>-f-<>-e1-<>-anarchy
>1-<>-f-<>-e2-<>-state (polity)
>2-<>-f-<>-c-<>-Anarchism
>2-<>-f-<>-e1-<>-anarchy
>2-<>-f-<>-e2-<>-political philosophy
>3-<>-f-<>-c-<>-Anarchism
>3-<>-f-<>-e1-<>-The Globe and Mail
>3-<>-f-<>-e2-<>-anarchy
>4-<>-f-<>-c-<>-Anarchism
>4-<>-f-<>-e1-<>-anarchy
>4-<>-f-<>-e2-<>-stateless society
>
>Is there a way I can know the number of bytes occupied by each key:value
>for each cell ???
>
>On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
>ustaudinger@activequant.org> wrote:
>
>> the point, I refer to is not so much about when hbase's server side
>> flushes, but when the client side flushes.
>> If you put every value immediately, it will result every time in an RPC
>> call. If you collect the data on the client side and flush (on the
>>client
>> side) manually, it will result in one RPC call with hundred or thousand
>> small puts inside, instead of hundred or thousands individual put RPC
>> calls.
>>
>> Another issue is, I am not so sure what happens if you collect hundreds
>>of
>> thousands of small puts, which might possibly be bigger than the
>>memstore,
>> and flush then. I guess the hbase client will hang.
>>
>>
>>
>>
>> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
>> >wrote:
>>
>> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size"
>>do
>> > the bulk insert ??? I was of the opinion that Hbase would flush all
>>the
>> > puts to the disk when it's memstore is filled, whose property is
>>defined
>> in
>> > hbase-default.xml. Is my understanding wrong here ???
>> >
>> >
>> >
>> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
>> > ustaudinger@activequant.org> wrote:
>> >
>> > > Hi there,
>> > >
>> > > while I cannot give you any concrete advice on your particular
>>storage
>> > > problem, I can share some experiences with you regarding
>>performance.
>> > >
>> > > I also bulk import data regularly, which is around 4GB every day in
>> about
>> > > 150 files with something between 10'000 to 30'000 lines in it.
>> > >
>> > > My first approach was to read every line and put it separately.
>>Which
>> > > resulted in a load time of about an hour. My next approach was to
>>read
>> an
>> > > entire file, put each individual put into a list and then store the
>> > entire
>> > > list at once. This works fast in the beginning, but after about 20
>> files,
>> > > the server ran into compactions and couldn't cope with the load and
>> > > finally, the master crashed, leaving regionserver and zookeeper
>> running.
>> > To
>> > > HBase's defense, I have to say that I did this on a standalone
>> > installation
>> > > without Hadoop underneath, so the test may not be entirely fair.
>> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
>>also
>> > put
>> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and
>>have
>> > > insert times of around 0.5ms per row - which is very decent. My
>>entire
>> > > import now takes only 7 minutes.
>> > >
>> > > I think you must find a balance regarding the performance of your
>> servers
>> > > and how quick they are with compactions and the amount of data you
>>put
>> at
>> > > once. I have definitely found single puts to result in low
>>performance.
>> > >
>> > > Best regards,
>> > > Ulrich
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
>><kranthili2020@gmail.com
>> > > >wrote:
>> > >
>> > > > No, I split the table on the fly. This I have done because
>>converting
>> > my
>> > > > table into Hbase format (rowID, family, qualifier, value) would
>> result
>> > in
>> > > > the input file being arnd 300GB. Hence, I had decided to do the
>> > splitting
>> > > > and generating this format on the fly.
>> > > >
>> > > > Will this effect the performance so heavily ???
>> > > >
>> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhihong@gmail.com> wrote:
>> > > >
>> > > > > May I ask whether you pre-split your table before loading ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy
>><kranthili2020@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster
>>of 4
>> > > > > machines
>> > > > > > and am trying to insert data. 3 of the machines are
>>tasktrackers,
>> > > with
>> > > > 4
>> > > > > > map tasks each.
>> > > > > >
>> > > > > >    My data consists of about 1.3 billion rows with 4 columns
>>each
>> > > > (100GB
>> > > > > > txt file). The column structure is "rowID, word1, word2,
>>word3".
>> >  My
>> > > > DFS
>> > > > > > replication in hadoop and hbase is set to 3 each. I have
put
>>only
>> > one
>> > > > > > column family and 3 qualifiers for each field (word*).
>> > > > > >
>> > > > > >    I am using the SampleUploader present in the HBase
>> distribution.
>> > > To
>> > > > > > complete 40% of the insertion, it has taken around 21 hrs
and
>> it's
>> > > > still
>> > > > > > running. I have 12 map tasks running.* I would like to know
is
>> the
>> > > > > > insertion time taken here on expected lines ??? Because
when I
>> used
>> > > > > lucene,
>> > > > > > I was able to insert the entire data in about 8 hours.*
>> > > > > >
>> > > > > >    Also, there seems to be huge explosion of data size here.
>> With a
>> > > > > > replication factor of 3 for HBase, I was expecting the table
>>size
>> > > > > inserted
>> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file
I
>>have,
>> > > 300GB
>> > > > > for
>> > > > > > replicating the data 3 times and 50+ GB for additional storage
>> > > > > > information). But even for 40% completion of data insertion,
>>the
>> > > space
>> > > > > > occupied is around 550GB (Looks like it might take around
>>1.2TB
>> for
>> > > an
>> > > > > > 100GB file).* I have used the rowID to be a String, instead
of
>> > Long.
>> > > > Will
>> > > > > > that account for such rapid increase in data storage???
>> > > > > > *
>> > > > > >
>> > > > > > Regards,
>> > > > > > Kranthi
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Kranthi Reddy. B
>> > > >
>> > > > http://www.setusoftware.com/setu/index.htm
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Kranthi Reddy. B
>> >
>> > http://www.setusoftware.com/setu/index.htm
>> >
>>
>
>
>
>-- 
>Kranthi Reddy. B
>
>http://www.setusoftware.com/setu/index.htm



Mime
View raw message