Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: <1896274529.1277452.1459058455138.JavaMail.yahoo@mail.yahoo.com>
References: 
 <CAKu+SFP=3+QQX2hBhS2TqHo+Pnt6WaXVNGime0JJs8=d8wh_jw@mail.gmail.com>
	<1896274529.1277452.1459058455138.JavaMail.yahoo@mail.yahoo.com>
Date: Tue, 29 Mar 2016 07:50:26 -0700
Message-ID: 
 <CAKu+SFOR+8=POFnqDF3sZvcoXf52MnbG9CD7Fu-GznD8hQuBOw@mail.gmail.com>
Subject: Re: Inconsistent scan performance
From: "james.johansville" <james.johansville@gmail.com>
To: user@hbase.apache.org, larsh@apache.org
Content-Type: multipart/alternative; boundary=001a113ef956525215052f3126f5

--001a113ef956525215052f3126f5
Content-Type: text/plain; charset=UTF-8

Hello all,

Thank you for the help. It was the network latency between datacenters that
caused it; once I started forcing scans to be performed near the
regionserver they are located within, the performance issues disappeared. I
am surprised I did not see nearly as much variance within writes even
though these were crossing datacenters as well.


On Sat, Mar 26, 2016 at 11:00 PM, <larsh@apache.org> wrote:

> Hi James,
>
> The low numbers are _way_ slow. Way lower than what we typically see.
> I assume the region servers have spinning drives? How many?Can you report
> metrics like IOWAIT. Also as Nicolas mentioned, can you report on the data
> locality index on each region server?
>
> What's the networking between the machines? I assume 1ge? And all NICs are
> good (none reports a lower link speed?)
>
> Is it always the same servers that are slow, or does it vary?
>
> Lastly, how big are the rows?
>
> Thanks.
>
> -- Lars
>
>       From: James Johansville <james.johansville@gmail.com>
>  To: user@hbase.apache.org
>  Sent: Friday, March 25, 2016 12:23 PM
>  Subject: Re: Inconsistent scan performance
>
> Hello all,
>
> I have 13 RegionServers and presplit into 13 regions (which motivated my
> comment that I aligned my queries with the regionservers, which obviously
> isn't accurate). I have been testing using a multiple of 13 for partitioned
> scans.
>
> Here are my current region setup -- I converted the row identifier to a
> logical identifier known to my code for easier read access. There are 26
> regions split roughly evenly over a logical range of 0-996.
>
> Here are region boundaries. The splits aren't perfect, so I assume for a 26
> partition scan, each scan is going to read most of one region and a tiny
> bit of another region (maybe three). I have given each server an index and
> marked its datacenter with an index as well -- so the first three regions
> are on server #9, and so on.
>
> 0 region range=[EMPTY]~id:38 server index=9 dc=1
> 1 region range=id:38~id:77 server index=9 dc=1
> 2 region range=id:77~id:115 server index=9 dc=1
> 3 region range=id:115~id:154 server index=10 dc=3
> 4 region range=id:154~id:192 server index=3 dc=2
> 5 region range=id:192~id:231 server index=10 dc=3
> 6 region range=id:231~id:269 server index=10 dc=3
> 7 region range=id:269~id:308 server index=10 dc=3
> 8 region range=id:308~id:346 server index=1 dc=1
> 9 region range=id:346~id:385 server index=4 dc=2
> 10 region range=id:385~id:423 server index=2 dc=2
> 11 region range=id:423~id:462 server index=5 dc=3
> 12 region range=id:462~id:500 server index=11 dc=1
> 13 region range=id:500~id:539 server index=10 dc=3
> 14 region range=id:539~id:577 server index=11 dc=1
> 15 region range=id:577~id:616 server index=11 dc=1
> 16 region range=id:616~id:654 server index=6 dc=2
> 17 region range=id:654~id:693 server index=12 dc=3
> 18 region range=id:693~id:730 server index=8 dc=1
> 19 region range=id:730~id:769 server index=4 dc=2
> 20 region range=id:769~id:806 server index=12 dc=3
> 21 region range=id:806~id:845 server index=11 dc=1
> 22 region range=id:845~id:883 server index=9 dc=1
> 23 region range=id:883~id:921 server index=12 dc=3
> 24 region range=id:921~id:958 server index=0 dc=3
> 25 region range=id:958:[EMPTY] server index=12 dc=3
>
>
> Here are my scan ranges -- the end isn't inclusive. Each scan range
> corresponds to a particular partition number in sequence.
> 0 scan range=id:0~id:39
> 1 scan range=id:39~id:78
> 2 scan range=id:78~id:117
> 3 scan range=id:117~id:156
> 4 scan range=id:156~id:195
> 5 scan range=id:195~id:234
> 6 scan range=id:234~id:273
> 7 scan range=id:273~id:312
> 8 scan range=id:312~id:351
> 9 scan range=id:351~id:389
> 10 scan range=id:389~id:427
> 11 scan range=id:427~id:465
> 12 scan range=id:465~id:503
> 13 scan range=id:503~id:541
> 14 scan range=id:541~id:579
> 15 scan range=id:579~id:617
> 16 scan range=id:617~id:655
> 17 scan range=id:655~id:693
> 18 scan range=id:693~id:731
> 19 scan range=id:731~id:769
> 20 scan range=id:769~id:807
> 21 scan range=id:807~id:845
> 22 scan range=id:845~id:883
> 23 scan range=id:883~id:921
> 24 scan range=id:921~id:959
> 25 scan range=id:959~id:997
>
> I ran the utility twice on 26-range.
> total # partitions:26; partition_number:0; rows:1775368 elapsed_sec:433.525
> ops/sec:4095.191742114065
> total # partitions:26; partition_number:1; rows:1774447 elapsed_sec:427.884
> ops/sec:4147.028166512419
> total # partitions:26; partition_number:2; rows:1774755 elapsed_sec:674.357
> ops/sec:2631.773674774637
> total # partitions:26; partition_number:3; rows:1776706 elapsed_sec:376.375
> ops/sec:4720.573895715709
> total # partitions:26; partition_number:4; rows:1777409 elapsed_sec:47.484
> ops/sec:37431.74543003959
> total # partitions:26; partition_number:5; rows:1779412 elapsed_sec:39.443
> ops/sec:45113.50556499252
> total # partitions:26; partition_number:6; rows:1773312 elapsed_sec:407.076
> ops/sec:4356.2184948265185
> total # partitions:26; partition_number:7; rows:1773790 elapsed_sec:436.169
> ops/sec:4066.749356327479
> total # partitions:26; partition_number:8; rows:1773311 elapsed_sec:617.88
> ops/sec:2869.99255518871
> total # partitions:26; partition_number:9; rows:1727384 elapsed_sec:27.348
> ops/sec:63163.0832236361
> total # partitions:26; partition_number:10; rows:1734960 elapsed_sec:64.383
> ops/sec:26947.48613764503
> total # partitions:26; partition_number:11; rows:1729728
> elapsed_sec:381.221 ops/sec:4537.336610522505
> total # partitions:26; partition_number:12; rows:1732999 elapsed_sec:59.049
> ops/sec:29348.49023692188
> total # partitions:26; partition_number:13; rows:1731442
> elapsed_sec:369.975 ops/sec:4679.889181701466
> total # partitions:26; partition_number:14; rows:1732026
> elapsed_sec:698.667 ops/sec:2479.0436645784043
> total # partitions:26; partition_number:15; rows:1727737
> elapsed_sec:398.964 ops/sec:4330.558646895459
> total # partitions:26; partition_number:16; rows:1728357
> elapsed_sec:695.636 ops/sec:2484.570953774675
> total # partitions:26; partition_number:17; rows:1733250
> elapsed_sec:392.337 ops/sec:4417.758202769558
> total # partitions:26; partition_number:18; rows:1742667
> elapsed_sec:410.238 ops/sec:4247.94143887207
> total # partitions:26; partition_number:19; rows:1742777
> elapsed_sec:374.509 ops/sec:4653.498313792192
> total # partitions:26; partition_number:20; rows:1742520 elapsed_sec:44.447
> ops/sec:39204.445744369696
> total # partitions:26; partition_number:21; rows:1742248
> elapsed_sec:404.692 ops/sec:4305.120931473812
> total # partitions:26; partition_number:22; rows:1743288
> elapsed_sec:420.886 ops/sec:4141.948175990648
> total # partitions:26; partition_number:23; rows:1741106
> elapsed_sec:374.782 ops/sec:4645.650004535971
> total # partitions:26; partition_number:24; rows:1738634 elapsed_sec:34.059
> ops/sec:51047.71132446637
> total # partitions:26; partition_number:25; rows:1736522 elapsed_sec:38.873
> ops/sec:44671.67442698017
>
> total # partitions:26; partition_number:0; rows:1775368 elapsed_sec:687.482
> ops/sec:2582.4210670243006
> total # partitions:26; partition_number:1; rows:1774447 elapsed_sec:687.132
> ops/sec:2582.3961043875124
> total # partitions:26; partition_number:2; rows:1774755 elapsed_sec:57.592
> ops/sec:30815.998749826365
> total # partitions:26; partition_number:3; rows:1776706 elapsed_sec:369.096
> ops/sec:4813.6690725448125
> total # partitions:26; partition_number:4; rows:1777409 elapsed_sec:49.486
> ops/sec:35917.41098492503
> total # partitions:26; partition_number:5; rows:1779412 elapsed_sec:39.008
> ops/sec:45616.59146841673
> total # partitions:26; partition_number:6; rows:1773312 elapsed_sec:405.897
> ops/sec:4368.871905926873
> total # partitions:26; partition_number:7; rows:1773790 elapsed_sec:368.597
> ops/sec:4812.274652262498
> total # partitions:26; partition_number:8; rows:1773311 elapsed_sec:401.001
> ops/sec:4422.210917179758ex
> total # partitions:26; partition_number:9; rows:1727384 elapsed_sec:381.702
> ops/sec:4525.477990683832
> total # partitions:26; partition_number:10; rows:1734960
> elapsed_sec:659.132 ops/sec:2632.1890000788917
> total # partitions:26; partition_number:11; rows:1729728 elapsed_sec:65.535
> ops/sec:26393.95742732891
> total # partitions:26; partition_number:12; rows:1732999
> elapsed_sec:367.694 ops/sec:4713.15550430521
> total # partitions:26; partition_number:13; rows:1731442
> elapsed_sec:394.045 ops/sec:4394.020987450672
> total # partitions:26; partition_number:14; rows:1732026
> elapsed_sec:692.832 ops/sec:2499.9220590272967
> total # partitions:26; partition_number:15; rows:1727737
> elapsed_sec:637.833 ops/sec:2708.760757126082
> total # partitions:26; partition_number:16; rows:1728357 elapsed_sec:35.41
> ops/sec:48809.85597288902
> total # partitions:26; partition_number:17; rows:1733250
> elapsed_sec:348.711 ops/sec:4970.448308197906
> total # partitions:26; partition_number:18; rows:1742667
> elapsed_sec:747.852 ops/sec:2330.2297780844337
> total # partitions:26; partition_number:19; rows:1742777 elapsed_sec:21.719
> ops/sec:80242.04613472075
> total # partitions:26; partition_number:20; rows:1742520 elapsed_sec:41.713
> ops/sec:41774.02728166279
> total # partitions:26; partition_number:21; rows:1742248
> elapsed_sec:398.595 ops/sec:4370.97304281288
> total # partitions:26; partition_number:22; rows:1743288
> elapsed_sec:416.296 ops/sec:4187.61650364164
> total # partitions:26; partition_number:23; rows:1741106 elapsed_sec:376.67
> ops/sec:4622.364403854833
> total # partitions:26; partition_number:24; rows:1738634 elapsed_sec:34.025
> ops/sec:51098.72152828803
> total # partitions:26; partition_number:25; rows:1736522 elapsed_sec:38.673
> ops/sec:44902.69697204768
>
> I can't make rhyme or reason of it. There seems to be an imperfect
> region-to-region correspondence, but some queries that could cross multiple
> datacenters in theory appear fast. Some scans that are restricted to only
> one regionserver are fast, some are slow. Anyone with more experience who
> might have insight?
>
>
> On Fri, Mar 25, 2016 at 9:20 AM, Stack <stack@duboce.net> wrote:
>
> > On Fri, Mar 25, 2016 at 3:50 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > James:
> > > Another experiment you can do is to enable region replica -
> HBASE-10070.
> > >
> > > This would bring down the read variance greatly.
> > >
> > >
> > Suggest you NOT do this James.
> >
> > Lets figure your issue as-is rather than compound by adding yet more
> moving
> > parts.
> >
> > St.Ack
> >
> >
> >
> >
> > > > On Mar 25, 2016, at 2:41 AM, Nicolas Liochon <nkeywal@gmail.com>
> > wrote:
> > > >
> > > > The read path is much more complex than the write one, so the
> response
> > > time
> > > > has much more variance.
> > > > The gap is so wide here that I would bet on Ted's or Stack's points,
> > but
> > > > here are a few other sources of variance:
> > > > - hbase cache: as Anoop said, may be the data is already in the hbase
> > > cache
> > > > (setCacheBlocks(false), means "don't add blocks to the cache", not
> > "don't
> > > > use the cache")
> > > > - OS cache: and if the data is not in HBase cache may be it is in the
> > > > operating system cache (for example if you run the test multiple
> times)
> > > > - data locality: if you're lucky the data is local to the region
> > server.
> > > If
> > > > you're not, the reads need an extra network hoop.
> > > > - number of store: more hfiles/stores per region => slower reads.
> > > > - number of versions and so on: sub case of the previous one: if the
> > rows
> > > > have been updated multiple times and the compaction has not ran yet,
> > you
> > > > will read much more data.
> > > > - (another subcase): the data has not been flushed yet and is
> available
> > > in
> > > > the memstore => fast read.
> > > >
> > > > None of these points has any importance for the the write path.
> > Basically
> > > > the writes variance says nothing about the variance you will get on
> the
> > > > reads.
> > > >
> > > > IIRC, locality and number of stores are visible in HBase UI. Doing a
> > > table
> > > > flush and then running a major compaction generally helps to
> stabilize
> > > > response time when you do a test. But it should not explain the x25
> > > you're
> > > > seeing, there is something else somewhere else. I don't get the
> > > > regionserver boundaries you're mentioning: there is no boundary
> between
> > > > regionservers. A regionserver can host A->D and M->S while another
> > hosts
> > > > D->M and S->Z for example.
> > > >
> > > >> On Fri, Mar 25, 2016 at 6:51 AM, Anoop John <anoop.hbase@gmail.com>
> > > wrote:
> > > >>
> > > >> I see you set cacheBlocks to be false on the Scan.  By any chance on
> > > >> some other RS(s), the data you are looking for is already in cache?
> > > >> (Any previous scan or  by cache on write)  And there are no
> concurrent
> > > >> writes any way right?  This much difference in time !  One
> > > >> possibility is blocks avail or not avail in cache..
> > > >>
> > > >> -Anoop-
> > > >>
> > > >>> On Fri, Mar 25, 2016 at 11:04 AM, Stack <stack@duboce.net> wrote:
> > > >>> On Thu, Mar 24, 2016 at 4:45 PM, James Johansville <
> > > >>> james.johansville@gmail.com> wrote:
> > > >>>
> > > >>>> Hello all,
> > > >>>>
> > > >>>> So, I wrote a Java application for HBase that does a partitioned
> > > >> full-table
> > > >>>> scan according to a set number of partitions. For example, if
> there
> > > are
> > > >> 20
> > > >>>> partitions specified, then 20 separate full scans are launched
> that
> > > >> cover
> > > >>>> an equal slice of the row identifier range.
> > > >>>>
> > > >>>> The rows are uniformly distributed throughout the RegionServers.
> > > >>>
> > > >>>
> > > >>> How many RegionServers? How many Regions? Are Regions evenly
> > > distributed
> > > >>> across the servers? If you put all partitions on one machine and
> then
> > > run
> > > >>> your client, do the timings even out?
> > > >>>
> > > >>> The disparity seems really wide.
> > > >>>
> > > >>> St.Ack
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>> I
> > > >>>> confirmed this through the hbase shell. I have only one column
> > family,
> > > >> and
> > > >>>> each row has the same number of column qualifiers.
> > > >>>>
> > > >>>> My problem is that the individual scan performance is wildly
> > > >> inconsistent
> > > >>>> even though they fetch approximately a similar number of rows.
> This
> > > >>>> inconsistency appears to be random with respect to hosts or
> > > >> regionservers
> > > >>>> or partitions or CPU cores. I am the only user of the fleet and
> not
> > > >> running
> > > >>>> any other concurrent HBase operation.
> > > >>>>
> > > >>>> I started measuring from the beginning of the scan and stopped
> > > measuring
> > > >>>> after the scan was completed. I am not doing any logic with the
> > > results,
> > > >>>> just scanning them.
> > > >>>>
> > > >>>> For ~230K rows fetched per scan, I am getting anywhere from 4
> > seconds
> > > to
> > > >>>> 100+ seconds. This seems a little too bouncy for me. Does anyone
> > have
> > > >> any
> > > >>>> insight? By comparison, a similar utility I wrote to upsert to
> > > >>>> regionservers was very consistent in ops/sec and I had no issues
> > with
> > > >> it.
> > > >>>>
> > > >>>> Using 13 partitions on a machine that has 32 CPU cores and 16 GB
> > > heap, I
> > > >>>> see anywhere between 3K ops/sec to 82K ops/sec. Here's an example
> of
> > > log
> > > >>>> output I saved that used 130 partitions.
> > > >>>>
> > > >>>> total # partitions:130; partition id:47; rows:232730
> > elapsed_sec:6.401
> > > >>>> ops/sec:36358.38150289017
> > > >>>> total # partitions:130; partition id:100; rows:206890
> > > elapsed_sec:6.636
> > > >>>> ops/sec:31176.91380349608
> > > >>>> total # partitions:130; partition id:63; rows:233437
> > elapsed_sec:7.586
> > > >>>> ops/sec:30772.08014764039
> > > >>>> total # partitions:130; partition id:9; rows:232585
> > elapsed_sec:32.985
> > > >>>> ops/sec:7051.235410034865
> > > >>>> total # partitions:130; partition id:19; rows:234192
> > > elapsed_sec:38.733
> > > >>>> ops/sec:6046.3170939508955
> > > >>>> total # partitions:130; partition id:1; rows:232860
> > elapsed_sec:48.479
> > > >>>> ops/sec:4803.316900101075
> > > >>>> total # partitions:130; partition id:125; rows:205334
> > > elapsed_sec:41.911
> > > >>>> ops/sec:4899.286583474505
> > > >>>> total # partitions:130; partition id:123; rows:206622
> > > elapsed_sec:42.281
> > > >>>> ops/sec:4886.875901705258
> > > >>>> total # partitions:130; partition id:54; rows:232811
> > > elapsed_sec:49.083
> > > >>>> ops/sec:4743.210480206996
> > > >>>>
> > > >>>> I use setCacheBlocks(false), setCaching(5000).  Does anyone have
> any
> > > >>>> insight into how I can make the read performance more consistent?
> > > >>>>
> > > >>>> Thanks!
> > > >>
> > >
> >
>
>
>
>

--001a113ef956525215052f3126f5--