hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kranthi reddy <kranthili2...@gmail.com>
Subject Re: Unexpected Data insertion time and Data size explosion
Date Mon, 05 Dec 2011 17:26:06 GMT
1) Does having dfs.replication factor "3" in general result in table data
size of 3x + y (where x is the size of the file in local file system and y
is some additional space for meta information storage) ???

2) Does Hbase, pre allocate space for all the cell versions when the cell
is created for the first time?

Unfortunately, I am just unable to wrap my head around the problem of such
exponential increase of data size. Except for this case happening (which I
doubt), I just don't get it how such exponential growth of table data is
possible.

3) Or is it case where my KEY is being larger than VALUE and hence
resulting in such large size increase ???

*Similar to the the sample rows below, I have around 300 million entries
and the ROWID increases linearly*.

On Mon, Dec 5, 2011 at 10:03 PM, kranthi reddy <kranthili2020@gmail.com>wrote:

> Ok. But can some1 explain why the data size is exploding the way I have
> mentioned earlier.
>
> I have tried to insert sample data of arnd 12GB. The data occupied by
> Hbase table is arnd 130GB. All my columns i.e. including the ROWID are
> strings. I have even tried converting by ROWID to long, but that seems to
> occupy more space i.e. arnd 150GB.
>
> Sample rows
>
> 0-<>-f-<>-c-<>-Anarchism
> 0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
> 0-<>-f-<>-e2-<>-anarchy
> 1-<>-f-<>-c-<>-Anarchism
> 1-<>-f-<>-e1-<>-anarchy
> 1-<>-f-<>-e2-<>-state (polity)
> 2-<>-f-<>-c-<>-Anarchism
> 2-<>-f-<>-e1-<>-anarchy
> 2-<>-f-<>-e2-<>-political philosophy
> 3-<>-f-<>-c-<>-Anarchism
> 3-<>-f-<>-e1-<>-The Globe and Mail
> 3-<>-f-<>-e2-<>-anarchy
> 4-<>-f-<>-c-<>-Anarchism
> 4-<>-f-<>-e1-<>-anarchy
> 4-<>-f-<>-e2-<>-stateless society
>
> Is there a way I can know the number of bytes occupied by each key:value
> for each cell ???
>
>
> On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
> ustaudinger@activequant.org> wrote:
>
>> the point, I refer to is not so much about when hbase's server side
>> flushes, but when the client side flushes.
>> If you put every value immediately, it will result every time in an RPC
>> call. If you collect the data on the client side and flush (on the client
>> side) manually, it will result in one RPC call with hundred or thousand
>> small puts inside, instead of hundred or thousands individual put RPC
>> calls.
>>
>> Another issue is, I am not so sure what happens if you collect hundreds of
>> thousands of small puts, which might possibly be bigger than the memstore,
>> and flush then. I guess the hbase client will hang.
>>
>>
>>
>>
>> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
>> >wrote:
>>
>> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
>> > the bulk insert ??? I was of the opinion that Hbase would flush all the
>> > puts to the disk when it's memstore is filled, whose property is
>> defined in
>> > hbase-default.xml. Is my understanding wrong here ???
>> >
>> >
>> >
>> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
>> > ustaudinger@activequant.org> wrote:
>> >
>> > > Hi there,
>> > >
>> > > while I cannot give you any concrete advice on your particular storage
>> > > problem, I can share some experiences with you regarding performance.
>> > >
>> > > I also bulk import data regularly, which is around 4GB every day in
>> about
>> > > 150 files with something between 10'000 to 30'000 lines in it.
>> > >
>> > > My first approach was to read every line and put it separately. Which
>> > > resulted in a load time of about an hour. My next approach was to
>> read an
>> > > entire file, put each individual put into a list and then store the
>> > entire
>> > > list at once. This works fast in the beginning, but after about 20
>> files,
>> > > the server ran into compactions and couldn't cope with the load and
>> > > finally, the master crashed, leaving regionserver and zookeeper
>> running.
>> > To
>> > > HBase's defense, I have to say that I did this on a standalone
>> > installation
>> > > without Hadoop underneath, so the test may not be entirely fair.
>> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
>> also
>> > put
>> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
>> > > insert times of around 0.5ms per row - which is very decent. My entire
>> > > import now takes only 7 minutes.
>> > >
>> > > I think you must find a balance regarding the performance of your
>> servers
>> > > and how quick they are with compactions and the amount of data you
>> put at
>> > > once. I have definitely found single puts to result in low
>> performance.
>> > >
>> > > Best regards,
>> > > Ulrich
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <
>> kranthili2020@gmail.com
>> > > >wrote:
>> > >
>> > > > No, I split the table on the fly. This I have done because
>> converting
>> > my
>> > > > table into Hbase format (rowID, family, qualifier, value) would
>> result
>> > in
>> > > > the input file being arnd 300GB. Hence, I had decided to do the
>> > splitting
>> > > > and generating this format on the fly.
>> > > >
>> > > > Will this effect the performance so heavily ???
>> > > >
>> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhihong@gmail.com> wrote:
>> > > >
>> > > > > May I ask whether you pre-split your table before loading ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <
>> kranthili2020@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster
of
>> 4
>> > > > > machines
>> > > > > > and am trying to insert data. 3 of the machines are
>> tasktrackers,
>> > > with
>> > > > 4
>> > > > > > map tasks each.
>> > > > > >
>> > > > > >    My data consists of about 1.3 billion rows with 4 columns
>> each
>> > > > (100GB
>> > > > > > txt file). The column structure is "rowID, word1, word2,
word3".
>> >  My
>> > > > DFS
>> > > > > > replication in hadoop and hbase is set to 3 each. I have
put
>> only
>> > one
>> > > > > > column family and 3 qualifiers for each field (word*).
>> > > > > >
>> > > > > >    I am using the SampleUploader present in the HBase
>> distribution.
>> > > To
>> > > > > > complete 40% of the insertion, it has taken around 21 hrs
and
>> it's
>> > > > still
>> > > > > > running. I have 12 map tasks running.* I would like to know
is
>> the
>> > > > > > insertion time taken here on expected lines ??? Because
when I
>> used
>> > > > > lucene,
>> > > > > > I was able to insert the entire data in about 8 hours.*
>> > > > > >
>> > > > > >    Also, there seems to be huge explosion of data size here.
>> With a
>> > > > > > replication factor of 3 for HBase, I was expecting the table
>> size
>> > > > > inserted
>> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file
I have,
>> > > 300GB
>> > > > > for
>> > > > > > replicating the data 3 times and 50+ GB for additional storage
>> > > > > > information). But even for 40% completion of data insertion,
the
>> > > space
>> > > > > > occupied is around 550GB (Looks like it might take around
1.2TB
>> for
>> > > an
>> > > > > > 100GB file).* I have used the rowID to be a String, instead
of
>> > Long.
>> > > > Will
>> > > > > > that account for such rapid increase in data storage???
>> > > > > > *
>> > > > > >
>> > > > > > Regards,
>> > > > > > Kranthi
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Kranthi Reddy. B
>> > > >
>> > > > http://www.setusoftware.com/setu/index.htm
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Kranthi Reddy. B
>> >
>> > http://www.setusoftware.com/setu/index.htm
>> >
>>
>
>
>
> --
> Kranthi Reddy. B
>
> http://www.setusoftware.com/setu/index.htm
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message