Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 136737897 for ; Mon, 5 Dec 2011 05:23:49 +0000 (UTC) Received: (qmail 19270 invoked by uid 500); 5 Dec 2011 05:23:47 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 19237 invoked by uid 500); 5 Dec 2011 05:23:44 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 19211 invoked by uid 99); 5 Dec 2011 05:23:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 05:23:43 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kranthili2020@gmail.com designates 209.85.212.41 as permitted sender) Received: from [209.85.212.41] (HELO mail-vw0-f41.google.com) (209.85.212.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 05:23:38 +0000 Received: by vbbfn1 with SMTP id fn1so1276344vbb.14 for ; Sun, 04 Dec 2011 21:23:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=M2Oa7H0UM4gfqvNRnhIAKiB7YtmdL2jDNu3xmuu9dmE=; b=IETQwqEH11uGmzelKdpuZZRwzBixzWmXu7KxxApeXvBCx+5HslDB+7aGP2GfonVLjR rZmUf60bF0yFFcDDzEDq4JCCpzLFEFTKoEVy/61Ij1PvIBL/Lt9cxQrzq9EdCbGeDAfl tcGFO5rZoBPOlZauEDlycc5XPkLh+Lnps1W/4= MIME-Version: 1.0 Received: by 10.52.117.65 with SMTP id kc1mr3796354vdb.66.1323062596848; Sun, 04 Dec 2011 21:23:16 -0800 (PST) Received: by 10.52.6.162 with HTTP; Sun, 4 Dec 2011 21:23:16 -0800 (PST) In-Reply-To: References: Date: Mon, 5 Dec 2011 10:53:16 +0530 Message-ID: Subject: Re: Unexpected Data insertion time and Data size explosion From: kranthi reddy To: yuzhihong@gmail.com Cc: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=bcaec546911f19cb7c04b35184c5 --bcaec546911f19cb7c04b35184c5 Content-Type: text/plain; charset=ISO-8859-1 No, I split the table on the fly. This I have done because converting my table into Hbase format (rowID, family, qualifier, value) would result in the input file being arnd 300GB. Hence, I had decided to do the splitting and generating this format on the fly. Will this effect the performance so heavily ??? On Mon, Dec 5, 2011 at 1:21 AM, wrote: > May I ask whether you pre-split your table before loading ? > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy wrote: > > > Hi all, > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > machines > > and am trying to insert data. 3 of the machines are tasktrackers, with 4 > > map tasks each. > > > > My data consists of about 1.3 billion rows with 4 columns each (100GB > > txt file). The column structure is "rowID, word1, word2, word3". My DFS > > replication in hadoop and hbase is set to 3 each. I have put only one > > column family and 3 qualifiers for each field (word*). > > > > I am using the SampleUploader present in the HBase distribution. To > > complete 40% of the insertion, it has taken around 21 hrs and it's still > > running. I have 12 map tasks running.* I would like to know is the > > insertion time taken here on expected lines ??? Because when I used > lucene, > > I was able to insert the entire data in about 8 hours.* > > > > Also, there seems to be huge explosion of data size here. With a > > replication factor of 3 for HBase, I was expecting the table size > inserted > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB > for > > replicating the data 3 times and 50+ GB for additional storage > > information). But even for 40% completion of data insertion, the space > > occupied is around 550GB (Looks like it might take around 1.2TB for an > > 100GB file).* I have used the rowID to be a String, instead of Long. Will > > that account for such rapid increase in data storage??? > > * > > > > Regards, > > Kranthi > -- Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm --bcaec546911f19cb7c04b35184c5--