Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C46E364F4 for ; Fri, 29 Jul 2011 16:49:23 +0000 (UTC) Received: (qmail 66498 invoked by uid 500); 29 Jul 2011 16:49:22 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 66443 invoked by uid 500); 29 Jul 2011 16:49:21 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 66435 invoked by uid 99); 29 Jul 2011 16:49:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 16:49:21 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.41] (HELO mail-gw0-f41.google.com) (74.125.83.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 16:49:13 +0000 Received: by gwaa20 with SMTP id a20so808137gwa.14 for ; Fri, 29 Jul 2011 09:48:52 -0700 (PDT) Received: by 10.68.60.166 with SMTP id i6mr2440442pbr.239.1311958131992; Fri, 29 Jul 2011 09:48:51 -0700 (PDT) Received: from [192.168.5.111] (ciscoap110041462.hoic.dca.wayport.net [64.134.221.160]) by mx.google.com with ESMTPS id d1sm2344971pbj.8.2011.07.29.09.48.49 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 29 Jul 2011 09:48:51 -0700 (PDT) Sender: Christopher Tarnas Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: GZ better than LZO? From: Chris Tarnas In-Reply-To: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC8@kairo.scch.at> Date: Fri, 29 Jul 2011 09:48:45 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <113761CA-0785-4599-8992-86B96C4EA65B@email.com> References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC4@kairo.scch.at> <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC8@kairo.scch.at> To: user@hbase.apache.org X-Mailer: Apple Mail (2.1084) Your region distribution across the nodes is not great, for both cases = most of your data is going to one server, spreading the regions out = across multiple servers would be best. How many different vehicle_ids are being used, and are they all = sequential integers in your tests? Hbase performs better when not doing = sequential inserts. You could try reversing the vehicle ids to get = around that (see the many discussions on the list about using reverse = timestamps as a rowkey) Looking at your key construction I would suggest, unless your app = requires it, to not left-pad your ids with zeros and rather use a = delimiter between the key components. That will lead to smaller keys, if = you use a tab as your delimiter that character falls before all other = alphanumeric and punctuation characters (other than LF, CR, etc - = characters that should not be in your IDs) so the keys will sort the = same and left padded ones.=20 I've had good luck with converting sequential numeric IDs to base 64 and = then reversing them - that leads to very good key distribution across = regions and shorter keys for any given number. Another option - if you = don't care if your rowkeys are plaintext, is to convert the IDs to = binary numbers and then reverse the bytes - that would be the most = compact. If you do that you would go back to not using delimiters and = just have fixed offsets for each component. Once you have a rowkey design you can then go ahead and create your = tables pre-split with multiple empty regions. That should perform much = better over all for inserts, especially when the DB is new and empty to = start. How did the load with 4 million records perform? -chris On Jul 29, 2011, at 12:36 AM, Steinmaurer Thomas wrote: > Hi Chris! >=20 > Your questions are somehow hard to answer for me, because I'm not = really > in charge for the test cluster from an administration/setup POV. >=20 > Basically, when running: > http://xxx:60010/master.jsp >=20 > I see 7 region servers. Each with a "maxHeap" value of 995. >=20 > When clicking on the different tables depending on the compression = type, > I get the following information: >=20 > GZ compressed table: 3 regions hosted by one region server > LZO compressed table: 8 regions hosted by two region servers, where = the > start region is hosted by one region server and all other 7 regions = are > hosted on the second region server >=20 > Regarding the insert pattern etc... please have a look on my reply to > Chiku, where I describe the test data generator and the table layout = etc > ... a bit. >=20 > Thanks, > Thomas >=20 > -----Original Message----- > From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris > Tarnas > Sent: Donnerstag, 28. Juli 2011 19:43 > To: user@hbase.apache.org > Subject: Re: GZ better than LZO? >=20 > During the load did you add enough data to do a flush or compaction? = P, > In our cluster that amount of data inserted would not necessarily be > enough to actually flush store files. Performance really depends on = how > the table's regions are laid out, the insert pattern, the number of > regionservers and the amount of RAM allocated to each regionserver. If > you don't see any flushes or compactions in the log try repeating that > test and then flushing the table and do a compaction (or add more data > so it happens automatically) and timing everything. It would be > interesting to see if the GZ benefit holds up. >=20 > -chris >=20 > On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote: >=20 >> Hello, >>=20 >>=20 >>=20 >> we ran a test client generating data into GZ and LZO compressed = table. >> Equal data sets (number of rows: 1008000 and the same table schema). = ~ >> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ = is >=20 >> ~ >> 444 MB, so basically half of LZO. >>=20 >>=20 >>=20 >> Execution time of the data generating client was 1373 seconds into = the >=20 >> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The = data >=20 >> generation client is based on HTablePool and using batch operations. >>=20 >>=20 >>=20 >> So in our (simple) test, GZ beats LZO in both, disk usage and=20 >> execution time of the client. We haven't tried reads yet. >>=20 >>=20 >>=20 >> Is this an expected result? I thought LZO is the recommended=20 >> compression algorithm? Or does LZO outperforms GZ with a growing=20 >> amount of data or in read scenarios? >>=20 >>=20 >>=20 >> Regards, >>=20 >> Thomas >>=20 >>=20 >>=20 >=20