hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: HBase - Secondary Index
Date Wed, 09 Jan 2013 00:30:34 GMT
Different use cases.


For global point queries you want exactly what you said below.
For range scans across many rows you want Anoop's design. As usually it depends.


The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each
region (or at least each region server).


-- Lars



________________________________
 From: Michael Segel <michael_segel@hotmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, January 8, 2013 6:33 AM
Subject: Re: HBase - Secondary Index
 
So if you're using an inverted table / index why on earth are you doing it at the region level?


I've tried to explain this to others over 6 months ago and its not really a good idea. 

You're over complicating this and you will end up creating performance bottlenecks when your
secondary index is completely orthogonal to your row key. 

To give you an example... 

Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired
over the years from your Pathways product. 

Your primary key would be a combination of the Insurance Company's ID and their internal claim
ID for the individual claim. 
Your row would be all of the data associated to that claim.

So now lets say you want to find the average cost to repair a front end collision of an S80
Volvo. 
The make and model of the car would be orthogonal to the initial key. This means that the
result set containing insurance records for Front End collisions of S80 Volvos would be most
likely evenly distributed across the cluster's regions. 

If you used a series of inverted tables, you would be able to use a series of get()s to get
the result set from each index and then find their intersections. (Note that you could also
put them in sort order so that the intersections would be fairly straight forward to find.


Doing this at the region level isn't so simple. 

So I have to again ask why go through and over complicate things? 

Just saying... 

On Jan 7, 2013, at 7:49 AM, Anoop Sam John <anoopsj@huawei.com> wrote:

> Hi,
> It is inverted index based on column(s) value(s)
> It will be region wise indexing. Can work when some one knows the rowkey range or NOT.
> 
> -Anoop-
> ________________________________________
> From: Mohit Anchlia [mohitanchlia@gmail.com]
> Sent: Monday, January 07, 2013 9:47 AM
> To: user@hbase.apache.org
> Subject: Re: HBase - Secondary Index
> 
> Hi Anoop,
> 
> Am I correct in understanding that this indexing mechanism is only
> applicable when you know the row key? It's not an inverted index truly
> based on the column value.
> 
> Mohit
> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <anoopsj@huawei.com> wrote:
> 
>> Hi Adrien
>>                 We are making the consistency btw the main table and
>> index table and the roll back mentioned below etc using the CP hooks. The
>> current hooks were not enough for those though..  I am in the process of
>> trying to contribute those new hooks, core changes etc now...  Once all are
>> done I will be able to explain in details..
>> 
>> -Anoop-
>> ________________________________________
>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>> Sent: Monday, January 07, 2013 2:00 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>> 
>> Nice topic, perhaps one of the most important for 2013 :-)
>> I still don't get how you're ensuring consistency between index table and
>> main table, without an external component (such as bookkeeper/zookeeper).
>> What's the exact write path in your situation when inserting data ?
>> (WAL/RegionObserver, pre/post put/WALedit...)
>> 
>> The underlying question is about how you're ensuring that WALEdit in Index
>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>> case of issue in both WAL ?
>> 
>> 
>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <kelvin.msj@gmail.com>
>> wrote:
>> 
>>>> Yes as you say when the no of rows to be returned is becoming more and
>>> more the latency will be becoming more.  seeks within an HFile block is
>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>> also
>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>> measure the scan performance with this new encoding . Trying to >back
>> port
>>> a simple patch for 94 version just for testing...   Yes when the no of
>>> results to be returned is more and more any index will become less
>>> performing as per my study  :)
>>> 
>>> yes, you are right, I guess it's just a drawback of any index approach.
>>> Thanks for the explanation.
>>> 
>>> Shengjie
>>> 
>>> On 28 December 2012 04:14, Anoop Sam John <anoopsj@huawei.com> wrote:
>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>> 
>>>> -Anoop-
>>>> 
>>>> ________________________________________
>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: HBase - Secondary Index
>>>> 
>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <anoopsj@huawei.com>
>>>> wrote:
>>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more
>> and
>>>>> more the latency will be becoming more.  seeks within an HFile block
>> is
>>>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>> also
>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>> trying
>>> to
>>>>> measure the scan performance with this new encoding . Trying to back
>>>> port a
>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>> results to be returned is more and more any index will become less
>>>>> performing as per my study  :)
>>>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is
>> seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>> increase
>>>>> in latency is important :) Those were not high end machines.
>>>>> 
>>>>> -Anoop-
>>>>> ________________________________________
>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek
to
>>> the
>>>>>> exact position and return that row.
>>>>> 
>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>> just
>>>>> small number of rows returned, this works perfect. As you said you
>> will
>>>> get
>>>>> the exact rowkey positions per region, and simply seek them. I was
>>> trying
>>>>> to work out the case that when the number of result rows increases
>>>>> massively. Like in Anil's case, he wants to do a scan query against
>> the
>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>> timestamp2"
>>>>> given no customerId provided. During that time period, he might have
>> a
>>>> big
>>>>> chunk of rows from different customerIds. The index table returns a
>> lot
>>>> of
>>>>> rowkey positions for different customerIds (I believe they are
>>> scattered
>>>> in
>>>>> different regions), then you end up seeking all different positions
>> in
>>>>> different regions and return all the rows needed. According to your
>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>> it's
>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>> index, time spent climbs up way quicker than the case without index.
>>>>> 
>>>>> btw, quick question- in your presentation, the scale there is seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> - Shengjie
>>>>> 
>>>>> 
>>>>> On 27 December 2012 15:54, Anoop John <anoop.hbase@gmail.com> wrote:
>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>> perform againt the main table
>>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek
>> to
>>>> the
>>>>>> exact position and return that row.
>>>>>> 
>>>>>> -Anoop-
>>>>>> 
>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>> kelvin.msj@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> All the best,
>>>>> Shengjie Min
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> All the best,
>>> Shengjie Min
>>> 
>> 
>> 
>> 
>> --
>> Adrien Mogenet
>> 06.59.16.64.22
>> http://www.mogenet.me
>> 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message