Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of ghendrey@decarta.com
 designates 208.81.204.160 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Scalability problem with HBase
Date: Sun, 23 Jan 2011 17:40:49 -0800
Message-ID: 
 <6C5C1804772DB944BA88A0DC48D338DA0A4AA0B6@dct-mail.sanjose.telcontar.com>
In-Reply-To: <AANLkTim11fvAXhry8pQ09nkqTdMG7vuLMArCUy3fg4NA@mail.gmail.com>
Thread-Topic: Scalability problem with HBase
Thread-Index: Acu7Rmb1k+0N9YC7THuK9RKJNq0cEgAITetA
References: <AANLkTim11fvAXhry8pQ09nkqTdMG7vuLMArCUy3fg4NA@mail.gmail.com>
From: "Geoff Hendrey" <ghendrey@decarta.com>
To: <user@hbase.apache.org>

just curious what you mean by "reverse search index".=20

-g

-----Original Message-----
From: Thibault Dory [mailto:dory.thibault@gmail.com]=20
Sent: Sunday, January 23, 2011 1:42 PM
To: user@hbase.apache.org
Subject: Scalability problem with HBase

Hello,

I'm currently testing the performances of HBase for a specific test
case. I
have downloaded ~20000 articles from Wikipedia and I want to test the
performances of read/writes and MapReduce.
I'm using HBase 0.20.6 and Hadoop 0.20.2 on a cluster of Ubuntu powered
servers connected with Gigabit ethernet.

My test works like this :
 - I start with 3 physical server, used like this : 3 hadoop nodes (1
namenode and 3 datanode) and for HBase : 1 master and 3 regionserver.
 - I insert all the articles, with one article by row that contains to
cells
: ID and article
 - I start 3 threads from another machine, reading and updating (I
simply
append the string "1" to the end of the article) articles randomly and I
measure the time needed for all the operations to finish
 - I build a reverse search index using two phases of MapReduce and
measure
the time to compute it
 - then I add a new server on wich I start a datanode and a region
server
and I start the benchmark again with 4 thread
 - I repeat those steps until I reach the last available server (8 in
total)

I am keeping the total number of operations as a constant and appending
"1"
to an article does not change much it's size.

The problem is the kind of results I'm seeing, I believed that the time
needed to perform the read/writes operations would decrease as I add new
servers to the cluster but I'm experiencing exactly the opposite.
Moreover,
the more request I make, the slower the cluster become, for a constant
size.
For example here are the results in seconds that I have on my cluster
just
after the first insertion with 3 nodes for 10000 operations ( 20% of
wich
are updating the articles) :

Individual times : [30.338116567, 24.402751402, 25.650858953,
27.699796324,
26.589869283, 33.909433157, 52.538378122, 48.0114018, 47.149348721,
42.825791078]
Then one minute after this runs ends, everything else staying the same :
Individual times : [58.181552147, 48.931155328, 62.509309199,
57.198395723,
63.267397201, 54.160937835, 57.635454167, 64.780292628, 62.762390414,
61.381563914]
And finaly five minutes after the last run ends, everything else staying
the
same :
Individual times : [56.852388792, 58.011768345, 63.578745601,
68.008043323,
79.545419247, 87.183962628, 88.1989561, 94.532923849, 99.852569437,
102.355709259]

It seems quite clear that the time needed to perform the same amount of
operations is rising fast.

When I add server to the cluster, the time needed to perform the
operations
keeps rising. Here are the results for 4 servers using the same
methodology
as above :
Immediately after the new server is added
Individual times : [86.224951713, 80.777746425, 84.814954717,
93.07842057,
83.348558502, 90.037499401, 106.799544002, 98.122952552, 97.057614119,
94.277285461]
One minute after last test
Individual times : [94.633454698, 101.250176482, 99.945406887,
101.754011832, 106.882328108, 97.808320021, 97.050036703, 95.844557847,
97.931572694, 92.258327247]
Five minute after last test
Individual times : [98.188162512, 96.332809905, 93.598184149,
93.552745204,
96.905860067, 102.149408296, 101.545412423, 105.377292242,
108.855117219,
110.429000567]

The times needed to compute the inverse search index using MapReduce are
rising too :
3 nodes
Results : [106.604148815, 104.829340323, 101.986450167, 102.871575842,
102.177574017]
4 nodes
Results : [120.451610507, 115.007344179, 115.075212636, 115.146883431,
114.216465299]
5 nodes
Results : [139.563445944, 132.933993434, 134.117730658, 132.927127084,
132.041046308]


I don't think that this behaviour is normal, I should see the time
needed to
complete the same amount of work decreasing as I add more servers in the
cluster. Unless this is because my cluster is too small?
I should say that all the servers in the cluster seem to use an equal
amount
of CPU while the test is running, so it looks like all of them are
working
and there is no server that is not storing data.

What do you think? Where did I screw up to see that kind of results with
HBase?