hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: HBase - Secondary Index
Date Tue, 18 Dec 2012 09:35:44 GMT
Hi Mike
>My question is that since you don't have any formal SQL syntax, how are you doing this
all server side?
I think the question is to Anil.. In his case he is not doing the index data scan at the server
side. He scan the index table data back to client and from client doing gets to get the main
table data.  Correct Anil?
Just making  it clear... :)

-Anoop-
________________________________________
From: Michel Segel [michael_segel@hotmail.com]
Sent: Tuesday, December 18, 2012 2:32 PM
To: user@hbase.apache.org
Cc: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Just a couple of questions...

First, since you don't have any natural secondary indices, you can create one from a couple
of choices. Keeping it simple, you choose an inverted table as your index.

In doing so, you have one column containing all of the row ids for a given value.
This means that it is a simple get().

My question is that since you don't have any formal SQL syntax, how are you doing this all
server side?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 18, 2012, at 2:28 AM, anil gupta <anilgupta84@gmail.com> wrote:

> Hi Anoop,
>
> Please find my reply inline.
>
> Thanks,
> Anil Gupta
>
> On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <anoopsj@huawei.com> wrote:
>
>> Hi Anil
>>                During the scan, there is no need to fetch any index data
>> to client side. So there is no need to create any scanner on the index
>> table at the client side. This happens at the server side.
>
>
>>
>> For the Scan on the main table with condition on timestamp and customer
>> id, a scanner to be created with Filters. Yes like normal when there is no
>> secondary index. So this scan from the client will go through all the
>> regions in the main table.
>
>
> Anil: Do you mean that if the table is spread across 50 region servers in
> 60 node cluster then we need to send a scan request to all the 50 RS.
> Right? Doesn't it sounds expensive? IMHO you were not doing this in your
> solution. Your solution looked cleaner than this since you exactly knew
> which Node you need to go to for querying while using secondary index due
> to co-location(due to static begin part for secondary table rowkey) of
> region of primary table and secondary index table. My problem is little
> more complicated due to the constraints that: I cannot have a "static begin
> part" in the rowkey of my secondary table.
>
> When it scans one particular region say (x,y] on the main table, using the
>> CP we can get the index table region object corresponding to this main
>> table region from the RS.  There is no issue in creating the static part of
>> the rowkey. You know 'x' is the region start key. Then at the server side
>> will create a scanner on the index region directly and here we can specify
>> the startkey. 'x' + <timestamp value> + <customer id>..  Using the results
>> from the index scan we will make reseek on the main region to the exact
>> rows where the data what we are interested in is available. So there wont
>> be a full region data scan happening.
>
>> When in the cases where only timestamp is there but no customer id, it
>> will be simple again. Create a scanner on the main table with only one
>> filter. At the CP side the scanner on the index region will get created
>> with startkey as 'x' + <timestamp value>..    When you create the scan
>> object and set startRow on that it need not be the full rowkey. It can be
>> part of the rowkey also. Yes like prefix.
>>
>> Hope u got it now :)
> Anil: I hope now we are on same page. Thanks a lot for your valuable time
> to discuss this stuff.
>
>>
>> -Anoop-
>> ________________________________________
>> From: anil gupta [anilgupta84@gmail.com]
>> Sent: Friday, December 14, 2012 11:31 PM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>>
>> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <anoopsj@huawei.com>
>> wrote:
>>
>>> Hi Anil,
>>>
>>>> 1. In your presentation you mentioned that region of Primary Table and
>>> Region of Secondary Table are always located on the same region server.
>> How
>>> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
>>> of Secondary Table? Will your implementation work if the rowkey of
>> primary
>>> table cannot be used as prefix in rowkey of Secondary table( i have this
>>> limitation in my use case)?
>>> First all there will be same number of regions in both primary and index
>>> tables. All the start/stop keys of the regions also will be same.
>>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
>>> Then we will create 2 regions in index table also with same key ranges.
>>> At the master balancing level it is easy to collocate these regions
>> seeing
>>> the start and end keys.
>>> When the selection of the rowkey that will be used in the index table is
>>> the key here.
>>> What we will do is all the rowkeys in the index table will be prefixed
>>> with the start key of the region/
>>> When an entry is added to the main table with rowkey as 5 it will go to
>>> the 1st region (0-10)
>>> Now there will be index region with range as 0-10.  We will select this
>>> region to store this index data.
>>> The row getting added into the index region for this entry will have a
>>> rowkey 0_x_5
>>> I am just using '_' as a seperator here just to show this. Actually we
>>> wont be having any seperator.
>>> So the rowkeys (in index region) will have a static begin part always.
>>> Will scan time also we know this part and so the startrow and endrow
>>> creation for the scan will be possible.. Note that we will store the
>> actual
>>> table row key as the last part of the index rowkey itself not as a value.
>>> This is better option in our case of handling the scan index usage also
>> at
>>> sever side.  There is no index data fetch to client side..
>>
>> Anil: My primary table rowkey is customerId+event_id, and my secondary
>> table rowkey is timestamp+ customerid. In your implementation it seems like
>> for using secondary index the application needs to know about the
>> "start_key" of the region(static begin part) it wants to query. Right? Do
>> you separately manage the logic of determining the region
>> "start_key"(static begin part) for a scan?
>> Also, Its possible that while using secondary index the customerId is not
>> provided. So, i wont be having customer id for all the queries. Hence i
>> cannot use customer_id as a prefix in rowkey of my Secondary Table.
>>
>>>
>>> I feel your use case perfectly fit with our model
>> Anil: Somehow i am unable to fit your implementation into my use case due
>> to the constraint of static begin part of rowkey in Secondary table. There
>> seems to be a disconnect. Can you tell me how does my use case fits into
>> your implementation?
>>
>>>
>>>> 2. Are you using an Endpoint or Observer for building the secondary
>> index
>>> table?
>>> Observer
>>>
>>>> 3. "Custom balancer do collocation". Is it a custom load balancer of
>> HBase
>>> Master or something else?
>>> It is a balancer implementation which will be plugged into Master
>>>
>>>> 4. Your region split looks interesting. I dont have much info about it.
>>> Can
>>> you point to some docs on IndexHalfStoreFileReader?
>>> Sorry I am not able to publish any design doc or code as the company has
>>> not decided to open src the solution yet.
>>> Any particular query you come acorss pls feel free to aske me :)
>>> You can see the HalfStoreFileReader class 1st..
>>>
>>> -Anoop-
>>> ________________________________________
>>> From: anil gupta [anilgupta84@gmail.com]
>>> Sent: Friday, December 14, 2012 2:11 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: HBase - Secondary Index
>>>
>>> Hi Anoop,
>>>
>>> Nice presentation and seems like a smart implementation. Since the
>>> presentation only covered bullet points so i have couple of questions on
>>> your implementation. :)
>>>
>>> Here is a recap to my implementation and our previous discussion on
>>> Secondary index:
>>>
>>> Here is the link to previous email thread:
>>> http://search-hadoop.com/m/1zWPMaaRtr .
>>>
>>> The secondary index is stored in table "B" as rowkey B --> family:<rowkey
>>> A>  . "<rowkey A>" is the column qualifier. Every row in B will only
on
>>> have one column "k" and the value of that column is the rowkey of A.
>>>
>>> Suppose i am storing customer events in table A. I have two requirement
>> for
>>> data query:
>>> 1. Query customer events on basis of customer_Id and event_ID.
>>> 2. Query customer events on basis of event_timestamp and customer_ID.
>>>
>>> 70% of querying is done by query#1, so i will create
>>> <customer_Id><event_ID> as row key of Table A.
>>> Now, in order to support fast results for query#2, i need to create a
>>> secondary index on A. I store that secondary index in B, rowkey of B is
>>> <event_timestamp><customer_ID>.Every row stores the corresponding
rowkey
>> of
>>> A.
>>>
>>> HBase Querying approach:
>>> 1. Scan the secondary table by using prefix filter and startRow to get
>> the
>>> list of Rowkeys of Primary table.
>>> 2. Do a batch get on primary table by using HTable.get(List<Get>) method
>>> using the list of Rowkeys obtained in step1.
>>>
>>> The only issue is that in my solution i have at least two RPC calls. Once
>>> each in step1 and step2 above. I want to reduce the number of RPC to 1 if
>>> possible.
>>>
>>>
>>> ******Questions on your implementation:*********
>>>
>>> 1. In your presentation you mentioned that region of Primary Table and
>>> Region of Secondary Table are always located on the same region server.
>> How
>>> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
>>> of Secondary Table? Will your implementation work if the rowkey of
>> primary
>>> table cannot be used as prefix in rowkey of Secondary table( i have this
>>> limitation in my use case)?
>>> 2. Are you using an Endpoint or Observer for building the secondary index
>>> table?
>>> 3. "Custom balancer do collocation". Is it a custom load balancer of
>> HBase
>>> Master or something else?
>>> 4. Your region split looks interesting. I dont have much info about it.
>> Can
>>> you point to some docs on IndexHalfStoreFileReader?
>>>
>>> Thanks,
>>> Anil Gupta
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <anoopsj@huawei.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>>            Last week I got a chance to present the secondary indexing
>>>> solution what we have done in Huawei at the China Hadoop Conference.
>> You
>>>> can see the presentation from
>>>> http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf
>>>>
>>>>
>>>>
>>>> I would like to hear what others think on this. :)
>>>>
>>>>
>>>>
>>>> -Anoop-
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
Mime
View raw message