Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of alex.baranov.v@gmail.com
 designates 209.85.210.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO83RbUgPbcjGDD3xu98VQE3fHJ2OCjE=cX3i7sDsJERNrKg8w@mail.gmail.com>
References: 
 <CANDVYVkdOm5=B8bnAwm13A-T-DNrSxquxq_LGfMVxt=j3=G2GQ@mail.gmail.com>
	<CC776550.4D14F%doug.meil@explorysmedical.com>
	<CANDVYV=wGC_bePca3Vg4SUg6+M2R5q8ehP+TCQAbKdOc9mtYHA@mail.gmail.com>
	<CAO83RbUgPbcjGDD3xu98VQE3fHJ2OCjE=cX3i7sDsJERNrKg8w@mail.gmail.com>
Date: Mon, 17 Sep 2012 13:21:11 -0400
Message-ID: 
 <CAA7+SiCw97-x7HKwtvONX6svJvaZmG9hK5yO-EUp_WYMsH-e7A@mail.gmail.com>
Subject: Re: Hbase Scan - number of columns make the query performance way
 different
From: Alex Baranau <alex.baranov.v@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=14dae93405d508079804c9e900f2

--14dae93405d508079804c9e900f2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Are you using HBase Shell to test performance? In my experience, this may
be not a good idea if you run that from one of the nodes of your cluster.
The shell speed wasn't very representative.

Other than that:

> I have a hbase table which has a lot of columns in a single column family=
.
> eg. let's say I have a users table, then userid, username, email .... etc
> etc 15 fields all together are in the single columnFamily.

15 fields is not really "a lot of columns". Selecting several vs all should
not make big difference if they are in the same columnfamily. Unless some
of them have large values, so that it makes it longer to simply transfer
those values over the network (is your network fast, btw?).

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Sep 13, 2012 at 11:02 AM, Jacques <whshub@gmail.com> wrote:

> Not sure of your schema...
>
> Each column family is in a separate collection of StoreFiles. Scan all wi=
ll
> read all these files whereas your second scan will only read the StoreFil=
es
> associated with column family cf (difference if you have multiple column
> families).  Additionally, pushing a large amount of data from region
> servers to wherever you're running the shell will slow things down.
>
> It is difficult to respond to this unless you reveal your entire data
> structure and nature as well as your deployment scenario.
>
> Jacques
>
>
>
> On Thu, Sep 13, 2012 at 7:35 AM, Shengjie Min <kelvin.msj@gmail.com>
> wrote:
>
> > In my case, I am not feeding hbase result to mapred, it's just pure hba=
se
> > scan, returning all columns vs two columns makes huge difference to me.
> >
> > On 13 September 2012 15:29, Doug Meil <doug.meil@explorysmedical.com>
> > wrote:
> >
> > >
> > > Hi there, I don't know the specifics of your environment, but ...
> > >
> > > http://hbase.apache.org/book.html#perf.reading
> > > 11.8.2. Scan Attribute Selection
> > >
> > >
> > > =C5=A0 describes paying attention to the number of columns you are
> returning,
> > > particularly when using HBase as a MR source.  In short, returning on=
ly
> > > the columns you need means you are reducing the data transferred
> between
> > > the RS and the client and the number of KV's evaluated in the RS, etc=
.
> > >
> > >
> > >
> > >
> > > On 9/13/12 10:12 AM, "Shengjie Min" <kelvin.msj@gmail.com> wrote:
> > >
> > > >Hi,
> > > >
> > > >I found an interesting difference between hbase scan query.
> > > >
> > > >I have a hbase table which has a lot of columns in a single column
> > family.
> > > >eg. let's say I have a users table, then userid, username, email ...=
.
> > etc
> > > >etc 15 fields all together are in the single columnFamily.
> > > >
> > > >if you are familiar with RDBMS,
> > > >
> > > >query 1: select * from users
> > > >vs
> > > >query 2: select userid, username from users
> > > >
> > > >in mysql, these two has a difference, the query 2 will be obviously
> > > >faster,
> > > >but two queries won't give you a huge difference from performance
> > > >perspective.
> > > >
> > > >In Hbase, I noticed that:
> > > >
> > > >query 3: scan 'users',   // this is basically return me all 15 field=
s
> > > >vs
> > > >query 4: scan 'users', {COLUMNS=3D>['cf:userid','cf:username']}    /=
/
> this
> > > >is
> > > >return me only two fields: userid , username
> > > >
> > > >query 3 here takes way longer than query 4, Given a big data set. In
> my
> > > >test, I have around 1,000,000 user records. You are talking about
> query
> > 3
> > > >-
> > > >100 secs VS query 4 - a few secs.
> > > >
> > > >
> > > >Can anybody explain to me, why the width of the resultset in HBASE c=
an
> > > >impact the performance that much?
> > > >
> > > >
> > > >Shengjie Min
> > >
> > >
> > >
> >
> >
> > --
> > All the best,
> > Shengjie Min
> >
>

--14dae93405d508079804c9e900f2--