Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69AABDA33 for ; Mon, 17 Sep 2012 17:21:39 +0000 (UTC) Received: (qmail 70852 invoked by uid 500); 17 Sep 2012 17:21:37 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 70730 invoked by uid 500); 17 Sep 2012 17:21:37 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 70721 invoked by uid 99); 17 Sep 2012 17:21:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 17:21:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alex.baranov.v@gmail.com designates 209.85.210.169 as permitted sender) Received: from [209.85.210.169] (HELO mail-iy0-f169.google.com) (209.85.210.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 17:21:32 +0000 Received: by iagk10 with SMTP id k10so7083879iag.14 for ; Mon, 17 Sep 2012 10:21:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/o+/y+Bcbe0icMHtigMVSyQrNQydI5+CHlcs6holQ5Y=; b=vkwOvvbOe83IrfYI80yb7C8esvQJWN2ZBY8TGw8W8K3OGkqZKNZBqXX0wrGiMqjIdU FYc2STxKD+M43jQ3mkyJm7Lh4jW8i3C6pIMWmaIXa5XtxwcVZLD0BKeKqBpHMwqM/g/j ZUYzBMWeTfDVl/vEbHkoN/2AG1ZQGGlfHVG9T6HOj5gAVP4b5+76CGxytUUJva7LA6fz GYbXyWf8LvQaMxhcy3TON7u1KksxejV4Vb9KJYeYwmJdslzqLD6/GUu4F0zXPlFyghbZ yL3Wi5zzyalQam6ZxdMhflDHxSp7wvSOZnF4HYXpTlHfQ9IrSPW8NdiqfXG2rDJ3+wub Eeaw== MIME-Version: 1.0 Received: by 10.50.196.231 with SMTP id ip7mr7599728igc.7.1347902471931; Mon, 17 Sep 2012 10:21:11 -0700 (PDT) Received: by 10.50.91.164 with HTTP; Mon, 17 Sep 2012 10:21:11 -0700 (PDT) In-Reply-To: References: Date: Mon, 17 Sep 2012 13:21:11 -0400 Message-ID: Subject: Re: Hbase Scan - number of columns make the query performance way different From: Alex Baranau To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=14dae93405d508079804c9e900f2 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93405d508079804c9e900f2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Are you using HBase Shell to test performance? In my experience, this may be not a good idea if you run that from one of the nodes of your cluster. The shell speed wasn't very representative. Other than that: > I have a hbase table which has a lot of columns in a single column family= . > eg. let's say I have a users table, then userid, username, email .... etc > etc 15 fields all together are in the single columnFamily. 15 fields is not really "a lot of columns". Selecting several vs all should not make big difference if they are in the same columnfamily. Unless some of them have large values, so that it makes it longer to simply transfer those values over the network (is your network fast, btw?). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Sep 13, 2012 at 11:02 AM, Jacques wrote: > Not sure of your schema... > > Each column family is in a separate collection of StoreFiles. Scan all wi= ll > read all these files whereas your second scan will only read the StoreFil= es > associated with column family cf (difference if you have multiple column > families). Additionally, pushing a large amount of data from region > servers to wherever you're running the shell will slow things down. > > It is difficult to respond to this unless you reveal your entire data > structure and nature as well as your deployment scenario. > > Jacques > > > > On Thu, Sep 13, 2012 at 7:35 AM, Shengjie Min > wrote: > > > In my case, I am not feeding hbase result to mapred, it's just pure hba= se > > scan, returning all columns vs two columns makes huge difference to me. > > > > On 13 September 2012 15:29, Doug Meil > > wrote: > > > > > > > > Hi there, I don't know the specifics of your environment, but ... > > > > > > http://hbase.apache.org/book.html#perf.reading > > > 11.8.2. Scan Attribute Selection > > > > > > > > > =C5=A0 describes paying attention to the number of columns you are > returning, > > > particularly when using HBase as a MR source. In short, returning on= ly > > > the columns you need means you are reducing the data transferred > between > > > the RS and the client and the number of KV's evaluated in the RS, etc= . > > > > > > > > > > > > > > > On 9/13/12 10:12 AM, "Shengjie Min" wrote: > > > > > > >Hi, > > > > > > > >I found an interesting difference between hbase scan query. > > > > > > > >I have a hbase table which has a lot of columns in a single column > > family. > > > >eg. let's say I have a users table, then userid, username, email ...= . > > etc > > > >etc 15 fields all together are in the single columnFamily. > > > > > > > >if you are familiar with RDBMS, > > > > > > > >query 1: select * from users > > > >vs > > > >query 2: select userid, username from users > > > > > > > >in mysql, these two has a difference, the query 2 will be obviously > > > >faster, > > > >but two queries won't give you a huge difference from performance > > > >perspective. > > > > > > > >In Hbase, I noticed that: > > > > > > > >query 3: scan 'users', // this is basically return me all 15 field= s > > > >vs > > > >query 4: scan 'users', {COLUMNS=3D>['cf:userid','cf:username']} /= / > this > > > >is > > > >return me only two fields: userid , username > > > > > > > >query 3 here takes way longer than query 4, Given a big data set. In > my > > > >test, I have around 1,000,000 user records. You are talking about > query > > 3 > > > >- > > > >100 secs VS query 4 - a few secs. > > > > > > > > > > > >Can anybody explain to me, why the width of the resultset in HBASE c= an > > > >impact the performance that much? > > > > > > > > > > > >Shengjie Min > > > > > > > > > > > > > > > -- > > All the best, > > Shengjie Min > > > --14dae93405d508079804c9e900f2--