Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of konstt2000@gmail.com designates
 74.125.82.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1410673329.71674.YahooMailNeo@web140606.mail.bf1.yahoo.com>
References: 
 <CAE6kwsPHxQ1tsCZU0kW+0HCffF+ZJ-ihfN5XoboA9wT6NZBbEw@mail.gmail.com>
	<1410414612.16635.YahooMailNeo@web140603.mail.bf1.yahoo.com>
	<CAE6kwsPc7UAqNXU8UBaMzYXBJaUQa=UtGrMncQdSjV--5zgBmw@mail.gmail.com>
	<CAE6kwsO=h+ovKUbvQJY2t6QdmvO_jKqErfYZ7E5x1h9afrqNAg@mail.gmail.com>
	<BLU436-SMTP208D6E73AC40F3FF145C9288FCD0@phx.gbl>
	<CAE6kwsNgbvR_8ti4gxD=Tz5Q+-2RCAKcet7agr+hd_dx_cvGDQ@mail.gmail.com>
	<CAE6kwsMsRS0myNNDTYd6YKzC-B9FY-GUxgCAUMjygT=YEsHGEg@mail.gmail.com>
	<BLU436-SMTP237957AB38111BB13E5CD8D8FCD0@phx.gbl>
	<CAE6kwsPdOuuGgRZnietKFrzUCy8h2ZGx7JrjW0P7os7yGghT4A@mail.gmail.com>
	<BLU436-SMTP139A459353D7F634340F3E78FCD0@phx.gbl>
	<CAE6kwsO6Q_J_OXa6sPAhXHX_PxMobjuxrybc0L6yXcQCMjpB_g@mail.gmail.com>
	<BLU436-SMTP124C6400160EC11E10022EC8FCD0@phx.gbl>
	<CAE6kwsN5HvnKa8LZHurV+6yrvo++3QV2i=8R8wcZ=Tr2d84Egw@mail.gmail.com>
	<CAE6kwsOuVwH_ypSn=7HSRHdzR_dN+0=BxSU=05R8N8aY4nuNOQ@mail.gmail.com>
	<1410673329.71674.YahooMailNeo@web140606.mail.bf1.yahoo.com>
Date: Sun, 14 Sep 2014 11:24:37 +0200
Message-ID: 
 <CAE6kwsNAF9uab7rqP-u1g_UbzvL-NN786RFuQLyb1nJR5D58Lw@mail.gmail.com>
Subject: Re: Scan vs Parallel scan.
From: Guillermo Ortiz <konstt2000@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>,
 lars hofhansl <larsh@apache.org>
Content-Type: multipart/alternative; boundary=047d7b4508a84d9c8f05030316ba

--047d7b4508a84d9c8f05030316ba
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I don't have the code here,, but I'll put the code in a couple of days. I
have to check the executeservice again! I don't remember exactly how I did.

I'm using Hbase 0.98.

El domingo, 14 de septiembre de 2014, lars hofhansl <larsh@apache.org>
escribi=C3=B3:

> What specific version of 0.94 are you using?
>
> In general, if you have multiple spindles (disks) and/or multiple CPU
> cores at the region server you should benefits from keeping multiple regi=
on
> server handler threads busy. I have experimented with this before and saw=
 a
> close to linear speed up (up to the point where all disks/core were busy)=
.
> Obviously this also assuming this is the only load you throw at the serve=
rs
> at this point.
>
> Can you post your complete code to pastebin? Maybe even with some code to
> seed the data?
> How do you run your callables? Did you configure the ExecuteService
> correctly (assuming you use one to run your callables)?
>
> Then we can run it and have a look.
>
> Thanks.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
> To: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org
> <javascript:;>>
> Cc:
> Sent: Saturday, September 13, 2014 4:49 PM
> Subject: Re: Scan vs Parallel scan.
>
> What am I missing??
>
>
>
>
> 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>:
>
> > For an partial scan, I guess that I call to the RS to get data, it star=
ts
> > looking in the store files and recollecting the data. (It doesn't write
> to
> > the blockcache in both cases). It has ready the data and it gives to th=
e
> > client the data step by step, I mean,,, it depends the caching and
> batching
> > parameters.
> >
> > Big differences that I see...
> > I'm opening more connections to the Table, one for Region.
> >
> > I should check the single table scan, it looks like it does partial sca=
ns
> > sequentially. Since you can see on the HBase Master how the request
> > increase one after another, not all in the same time.
> >
> > 2014-09-12 15:23 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >
> >> It doesn=E2=80=99t matter which RS, but that you have 1 thread for eac=
h region.
> >>
> >> So for each thread, what=E2=80=99s happening.
> >> Step by step, what is the code doing.
> >>
> >> Now you=E2=80=99re comparing this against a single table scan, right?
> >> What=E2=80=99s happening in the table scan=E2=80=A6?
> >>
> >>
> >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >> > Right, My table for example has keys between 0-9. in three regions
> >> > 0-2,3-7,7-9
> >> > I lauch three partial scans in parallel. The scans that I'm executin=
g
> >> are:
> >> > scan(0,2), scan(3,7), scan(7,9).
> >> > Each region is if a different RS, so each thread goes to different R=
S.
> >> It's
> >> > not exactly like that, but on the benchmark case it's like it's
> working.
> >> >
> >> > Really the code will execute a thread for each Region not for each
> >> > RegionServer. But in the test I only have two regions for
> regionServer.
> >> I
> >> > dont' think that's an important point, there're two threads for RS.
> >> >
> >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >> >
> >> >> Ok, lets again take a step back=E2=80=A6
> >> >>
> >> >> So you are comparing your partial scan(s) against a full table scan=
?
> >> >>
> >> >> If I understood your question, you launch 3 partial scans where you
> set
> >> >> the start row and then end row of each scan, right?
> >> >>
> >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> >> wrote:
> >> >>
> >> >>> Okay, then, the partial scan doesn't work as I think.
> >> >>> How could it exceed the limit of a single region if I calculate th=
e
> >> >> limits?
> >> >>>
> >> >>>
> >> >>> The only bad point that I see it's that If a region server has thr=
ee
> >> >>> regions of the same table,  I'm executing three partial scans abou=
t
> >> this
> >> >> RS
> >> >>> and they could compete for resources (network, etc..) on this node=
.
> >> It'd
> >> >> be
> >> >>> better to have one thread for RS. But, that doesn't answer your
> >> >> questions.
> >> >>>
> >> >>> I keep thinking...
> >> >>>
> >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >> >>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> I wanted to take a step back from the actual code and to stop and
> >> think
> >> >>>> about what you are doing and what HBase is doing under the covers=
.
> >> >>>>
> >> >>>> So in your code, you are asking HBase to do 3 separate scans and
> then
> >> >> you
> >> >>>> take the result set back and join it.
> >> >>>>
> >> >>>> What does HBase do when it does a range scan?
> >> >>>> What happens when that range scan exceeds a single region?
> >> >>>>
> >> >>>> If you answer those questions=E2=80=A6 you=E2=80=99ll have your a=
nswer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>> -Mike
> >> >>>>
> >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2000@gmail.co=
m
> <javascript:;>>
> >> >> wrote:
> >> >>>>
> >> >>>>> It's not all the code, I set things like these as well:
> >> >>>>> scan.setMaxVersions();
> >> >>>>> scan.setCacheBlocks(false);
> >> >>>>> ...
> >> >>>>>
> >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>:
> >> >>>>>
> >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> >> >>>>>>
> >> >>>>>> I got the start and stop keys with this method:
> >> >>>>>> private List<RegionScanner> generatePartitions() {
> >> >>>>>>      List<RegionScanner> regionScanners =3D new
> >> >>>>>> ArrayList<RegionScanner>();
> >> >>>>>>      byte[] startKey;
> >> >>>>>>      byte[] stopKey;
> >> >>>>>>      HConnection connection =3D null;
> >> >>>>>>      HBaseAdmin hbaseAdmin =3D null;
> >> >>>>>>      try {
> >> >>>>>>          connection =3D HConnectionManager.
> >> >>>>>> createConnection(HBaseConfiguration.create());
> >> >>>>>>          hbaseAdmin =3D new HBaseAdmin(connection);
> >> >>>>>>          List<HRegionInfo> regions =3D
> >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >> >>>>>>          RegionScanner regionScanner =3D null;
> >> >>>>>>          for (HRegionInfo region : regions) {
> >> >>>>>>
> >> >>>>>>              startKey =3D region.getStartKey();
> >> >>>>>>              stopKey =3D region.getEndKey();
> >> >>>>>>
> >> >>>>>>              regionScanner =3D new RegionScanner(startKey, stop=
Key,
> >> >>>>>> scanConfiguration);
> >> >>>>>>              // regionScanner =3D createRegionScanner(startKey,
> >> >>>> stopKey);
> >> >>>>>>              if (regionScanner !=3D null) {
> >> >>>>>>                  regionScanners.add(regionScanner);
> >> >>>>>>              }
> >> >>>>>>          }
> >> >>>>>>
> >> >>>>>> And I execute the RegionScanner with this:
> >> >>>>>> public List<Result> call() throws Exception {
> >> >>>>>>      HConnection connection =3D
> >> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()=
);
> >> >>>>>>      HTableInterface table =3D
> >> >>>>>> connection.getTable(configuration.getTable());
> >> >>>>>>
> >> >>>>>>  Scan scan =3D new Scan(startKey, stopKey);
> >> >>>>>>      scan.setBatch(configuration.getBatch());
> >> >>>>>>      scan.setCaching(configuration.getCaching());
> >> >>>>>>      ResultScanner resultScanner =3D table.getScanner(scan);
> >> >>>>>>
> >> >>>>>>      List<Result> results =3D new ArrayList<Result>();
> >> >>>>>>      for (Result result : resultScanner) {
> >> >>>>>>          results.add(result);
> >> >>>>>>      }
> >> >>>>>>
> >> >>>>>>      connection.close();
> >> >>>>>>      table.close();
> >> >>>>>>
> >> >>>>>>      return results;
> >> >>>>>>  }
> >> >>>>>>
> >> >>>>>> They implement Callable.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> michael_segel@hotmail.com <javascript:;>
> >> >:
> >> >>>>>>
> >> >>>>>>> Lets take a step back=E2=80=A6.
> >> >>>>>>>
> >> >>>>>>> Your parallel scan is having the client create N threads where
> in
> >> >> each
> >> >>>>>>> thread, you=E2=80=99re doing a partial scan of the table where=
 each
> >> partial
> >> >>>> scan
> >> >>>>>>> takes the first and last row of each region?
> >> >>>>>>>
> >> >>>>>>> Is that correct?
> >> >>>>>>>
> >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> >> konstt2000@gmail.com <javascript:;>>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> I was checking a little bit more about,, I checked the cluste=
r
> >> and
> >> >>>> data
> >> >>>>>>> is
> >> >>>>>>>> store in three different regions servers, each one in a
> >> differente
> >> >>>> node.
> >> >>>>>>>> So, I guess the threads go to different hard-disks.
> >> >>>>>>>>
> >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a sing=
le
> >> scan
> >> >>>>>>> than
> >> >>>>>>>> this implementation. I based on this implementation
> >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >> >>>>>>>>
> >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> konstt2000@gmail.com <javascript:;>
> >> >:
> >> >>>>>>>>
> >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> 0.98,
> >> >>>>>>> although
> >> >>>>>>>>> there is not difference.
> >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> family
> >> >> and
> >> >>>> I
> >> >>>>>>> put
> >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> >> >>>>>>>>>
> >> >>>>>>>>> I think that it's not possible that I executing an complete
> scan
> >> >> for
> >> >>>>>>> each
> >> >>>>>>>>> thread since my data are the type:
> >> >>>>>>>>> 000001 f:q value=3D1
> >> >>>>>>>>> 000002 f:q value=3D2
> >> >>>>>>>>> 000003 f:q value=3D3
> >> >>>>>>>>> ...
> >> >>>>>>>>>
> >> >>>>>>>>> I add all the values and get the same result on a single sca=
n
> >> than
> >> >> a
> >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> don't
> >> >>>>>>> remember,
> >> >>>>>>>>> but like 4x  of the scan time.
> >> >>>>>>>>> I'm not using any filter for the scans.
> >> >>>>>>>>>
> >> >>>>>>>>> This is the way I calculate number of regions/scans
> >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> >> >>>>>>>>>     List<RegionScanner> regionScanners =3D new
> >> >>>>>>>>> ArrayList<RegionScanner>();
> >> >>>>>>>>>     byte[] startKey;
> >> >>>>>>>>>     byte[] stopKey;
> >> >>>>>>>>>     HConnection connection =3D null;
> >> >>>>>>>>>     HBaseAdmin hbaseAdmin =3D null;
> >> >>>>>>>>>     try {
> >> >>>>>>>>>         connection =3D
> >> >>>>>>>>>
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >> >>>>>>>>>         hbaseAdmin =3D new HBaseAdmin(connection);
> >> >>>>>>>>>         List<HRegionInfo> regions =3D
> >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >> >>>>>>>>>         RegionScanner regionScanner =3D null;
> >> >>>>>>>>>         for (HRegionInfo region : regions) {
> >> >>>>>>>>>
> >> >>>>>>>>>             startKey =3D region.getStartKey();
> >> >>>>>>>>>             stopKey =3D region.getEndKey();
> >> >>>>>>>>>
> >> >>>>>>>>>             regionScanner =3D new RegionScanner(startKey,
> stopKey,
> >> >>>>>>>>> scanConfiguration);
> >> >>>>>>>>>             // regionScanner =3D createRegionScanner(startKe=
y,
> >> >>>>>>> stopKey);
> >> >>>>>>>>>             if (regionScanner !=3D null) {
> >> >>>>>>>>>                 regionScanners.add(regionScanner);
> >> >>>>>>>>>             }
> >> >>>>>>>>>         }
> >> >>>>>>>>>
> >> >>>>>>>>> I did some test for a tiny table and I think that the range
> for
> >> >> each
> >> >>>>>>> scan
> >> >>>>>>>>> works fine. Although, I though that it was interesting that
> the
> >> >> time
> >> >>>>>>> when I
> >> >>>>>>>>> execute distributed scan is about 6x.
> >> >>>>>>>>>
> >> >>>>>>>>> I'm going to check about the hard disks, but I think that ti=
's
> >> >> right.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org
> <javascript:;>>:
> >> >>>>>>>>>
> >> >>>>>>>>>> Which version of HBase?
> >> >>>>>>>>>> Can you show us the code?
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long =
as
> >> the
> >> >>>>>>> single
> >> >>>>>>>>>> scan, which is suspicious because you say you have 6 region=
s.
> >> >>>>>>>>>> Are you sure you're not accidentally scanning all the data =
in
> >> each
> >> >>>> of
> >> >>>>>>>>>> your parallel scans?
> >> >>>>>>>>>>
> >> >>>>>>>>>> -- Lars
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> ________________________________
> >> >>>>>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
> >> >>>>>>>>>> To: "user@hbase.apache.org <javascript:;>" <
> user@hbase.apache.org <javascript:;>>
> >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Hi,
> >> >>>>>>>>>>
> >> >>>>>>>>>> I developed an distributed scan, I create an thread for eac=
h
> >> >> region.
> >> >>>>>>> After
> >> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> region
> >> >>>>>>> servers
> >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows an=
d
> >> >>>> execute a
> >> >>>>>>>>>> complete scan.
> >> >>>>>>>>>>
> >> >>>>>>>>>> My partitions are
> >> >>>>>>>>>> -01666 -> request 16665
> >> >>>>>>>>>> 016666-033332 -> request 16666
> >> >>>>>>>>>> 033332-049998 -> request 16666
> >> >>>>>>>>>> 049998-066664 -> request 16666
> >> >>>>>>>>>> 066664-083330 -> request 16666
> >> >>>>>>>>>> 083330- -> request 16671
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> >> >>>>>>>>>> Caching 10
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> >> >>>>>>>>>> Caching 100
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> >> >>>>>>>>>> Caching 1000
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> >> >> NORMAL:68288ms,Counter:2
> >> >>>>>>> ->
> >> >>>>>>>>>> Caching 1
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> >> >> NORMAL:2646ms,Counter:2
> >> >>>> ->
> >> >>>>>>>>>> Caching 100
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> >> >> NORMAL:3903ms,Counter:2
> >> >>>> ->
> >> >>>>>>>>>> Caching 1000
> >> >>>>>>>>>>
> >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don=
't
> >> know
> >> >>>> why
> >> >>>>>>>>>> it's
> >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> from
> >> >> hbase
> >> >>>>>>>>>> shell,
> >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> works
> >> >>>> better
> >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Any clue about it?
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>

--047d7b4508a84d9c8f05030316ba--