Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 22 Sep 2014 14:27:34 +0000 (UTC)
From: "Niels Basjes (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12741724.1410859108000.93885.1411396054681@Atlassian.JIRA>
In-Reply-To: <JIRA.12741724.1410859108000@Atlassian.JIRA>
References: <JIRA.12741724.1410859108000@Atlassian.JIRA>
 <JIRA.12741724.1410859108804@arcas>
Subject: [jira] [Commented] (HBASE-11990) Make setting the start and stop
 row for a specific prefix easier
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143245#comment-14143245 ] 

Niels Basjes commented on HBASE-11990:
--------------------------------------

I had a hunch that the solution direction [~lhofhansl] indicated had a negative performance implication.
My hunch:
# If I do a single threaded scan then
#* I expect the system to start at the region where the startRow lives and continue (from region to region) until the PrefixFilter says to stop.
# If I do a multi threaded scan (i.e. MapReduce job) then
#* I expect the system to create a mapper for all regions that include or are 'after' the startRow and 'before' the stopRow (which is the end of the table). 
#* Then at each region that is 'too far' the first row will cause the PrefixFilter to say "this is enough, no more".

So I tested this with a table that was presplit into 100 regions (splits at '01', '02', ....'98', '99')
I put in a single row using the shell
{code}put 'hbaseScanTest', '10',  'F:qualfier', 'value'{code}

When I scan for the prefix '10' I see 1 input record arriving at the mappers in both implementations.
Yet there is a difference in the HBase Counters:
* *"startRow/stopRow":  REGIONS_SCANNED = 1*
* *"startRow/PrefixFilter": REGIONS_SCANNED = 90*

So to me it seems my hunch was correct.
Now we could say to set the stopRow also with the PrefixFilter method ... but the only correct value you can put there that has no remaining overhead is the original implementation.

At this moment I think we should go back to my original implementation by calculating the correct startRow and stopRow.
As there has already been a lot of discussion about this subject: Please advise.

> Make setting the start and stop row for a specific prefix easier
> ----------------------------------------------------------------
>
>                 Key: HBASE-11990
>                 URL: https://issues.apache.org/jira/browse/HBASE-11990
>             Project: HBase
>          Issue Type: New Feature
>          Components: Client
>            Reporter: Niels Basjes
>         Attachments: 11990v4.txt, HBASE-11990-20140916-v2.patch, HBASE-11990-20140916-v3.patch, HBASE-11990-20140916-v5.patch, HBASE-11990-20140916-v6.patch, HBASE-11990-20140916.patch, HBASE-11990-20140917-v7.patch, HBASE-11990-20140919-v8.patch, HBASE-11990-20140921-v9.patch
>
>
> If you want to set a scan from your application to scan for a specific row prefix this is actually quite hard.
> As described in several places you can set the startRow to the prefix; yet the stopRow should be set to the prefix '+1'
> If the prefix 'ASCII' put into a byte[] then this is easy because you can simply increment the last byte of the array. 
> But if your application uses real binary rowids you may run into the scenario that your prefix is something like 
> {code}{ 0x12, 0x23, 0xFF, 0xFF }{code} Then the increment should be {code}{ 0x12, 0x24 }{code}
> I have prepared a proposed patch that makes setting these values correctly a lot easier.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)