Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CA7710740 for ; Thu, 16 Jan 2014 09:00:55 +0000 (UTC) Received: (qmail 74293 invoked by uid 500); 16 Jan 2014 09:00:51 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 74250 invoked by uid 500); 16 Jan 2014 09:00:51 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 74240 invoked by uid 99); 16 Jan 2014 09:00:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jan 2014 09:00:50 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of amits@infolinks.com designates 207.126.144.143 as permitted sender) Received: from [207.126.144.143] (HELO eu1sys200aog117.obsmtp.com) (207.126.144.143) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 16 Jan 2014 09:00:43 +0000 Received: from mail-ig0-f175.google.com ([209.85.213.175]) (using TLSv1) by eu1sys200aob117.postini.com ([207.126.147.11]) with SMTP ID DSNKUtefpskd0hRV0C19dbxq/R/o+WAYFmPk@postini.com; Thu, 16 Jan 2014 09:00:23 UTC Received: by mail-ig0-f175.google.com with SMTP id uq10so11194401igb.2 for ; Thu, 16 Jan 2014 01:00:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=ij08VfJSw6zRTl3RwsCVgd/q7/26cT+a1HsFg1I51Y8=; b=jfJDBR+spaHCG2B8gCMMkm0NxZRFmKf5BT/fXFxyxEnAwPBOkhAxUw3YfSPZDFez9r 794lHNv4XUoTS+fSLozaUiBLCpA5+2E0QDz4NybZ4pLKyRQbceDb+frY6TJwZdflllAz 6ZBlg3kg3gDgClbhmyUzw5tPntelia3+n+JPy9bDytwtz1rGJhg5AefnEkY5uUbF0J/G H2/stBFrWslZhZZgQ2YBsRcMsTqqmWqPE1bWeBJ9UwR6CZ7vvIPvoLPkeuqhUagOo+Jt P0ljGzVM+YgCAQ/fngGy4Umgw+X/TF8fXvqd2kJQL+mDcmrIpa1Z58f0LJUJMIg4YxK7 hPOg== X-Gm-Message-State: ALoCoQmMkZBDrg4dk4Sxy9sv2fBVZqAfSHf8VF6dNjzJtgPl6/vo/8zRNZsDjDVC5++v1DTDGe5+n70sqrJozw7Ip1kiCs0iCLZzrbELnduxkkhAbMgrZ2I9CJ/XlPN0ZnmLs96WHTljsIwBLh+0dTeIAJgogG4Kvewvz5zesdFn/6kMD7NqV0o= X-Received: by 10.42.38.138 with SMTP id c10mr1399115ice.66.1389862821921; Thu, 16 Jan 2014 01:00:21 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.42.38.138 with SMTP id c10mr1399105ice.66.1389862821830; Thu, 16 Jan 2014 01:00:21 -0800 (PST) Received: by 10.64.227.15 with HTTP; Thu, 16 Jan 2014 01:00:21 -0800 (PST) In-Reply-To: References: Date: Thu, 16 Jan 2014 11:00:21 +0200 Message-ID: Subject: Re: KeyValue size in bytes compared to store files size From: Amit Sela To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf30334ee5c8323404f012a761 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30334ee5c8323404f012a761 Content-Type: text/plain; charset=ISO-8859-1 I tried the bulk load and kv size counts with uncompressed table and it makes sense now. count is equal to store file size. I took a look at the (uncompressed) files and they seem to be OK. Entire bulk load is ~100GB, when using GZ ends up to be 7GB. Could such a compression ratio make sense in case of many qualifiers per row in a table (avg is 16 but in practice there are some rows with much more and even a small number of rows with hundreds of thousands...) ? If each KeyValue contains the rowkey, and the rowkeys contain more bytes than the qualifiers / values, than the rows repeat themselves in the HFile and actually make most of the HFile, right ? On Wed, Jan 15, 2014 at 9:52 PM, Stack wrote: > There can be a lot of duplication in what ends up in HFiles but 500MB -> > 32MB does seem too good to be true. > > Could you try writing without GZIP or mess with the hfile reader[1] to see > what your keys look like when at rest in an HFile (and maybe save the > decompressed hfile to compare sizes?) > > St.Ack > 1. http://hbase.apache.org/book.html#hfile > > > On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela wrote: > > > I'm talking about the store files size and the ratio between store file > > size and the byte count as counted in PutSortReducer. > > > > > > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu wrote: > > > > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1 > > > > > > > > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela > wrote: > > > > > > > Hi all, > > > > I'm trying to measure the size (in bytes) of the data I'm about to > load > > > > into HBase. > > > > I'm using bulk load with PutSortReducer. > > > > All bulk load data is loaded into new regions and not added to > existing > > > > ones. > > > > > > > > In order to count the size of all KeyValues in the Put object I > iterate > > > > over the Put's familyMap.values() and sum the KeyValue lengths. > > > > After loading the data, I check the region size by summing the > > > > RegionLoad.getStorefileSizeMB(). > > > > Counting the Put objects size predicted ~500MB per region but in > > > practice I > > > > got ~32MB per region. > > > > the table uses GZ compression but this cannot be the cause of such a > > > > difference. > > > > > > > > Is counting the Put's KeyValues the correct way to count a row size ? > > Is > > > it > > > > comparable to the store files size ? > > > > > > > > Thanks, > > > > Amit. > > > > > > > > > > --20cf30334ee5c8323404f012a761--