hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Sturm <mas9...@nyp.org>
Subject RE: question about composite rowKey and performance difference between getScanner() and get(Get[])
Date Fri, 05 Dec 2014 16:07:01 GMT
I will read it. Thanks!
The size of data is not A or B uniqueIds is pretty small compare to whole dataset, so I think
that points to the unique table solution.
Marc

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, December 04, 2014 1:12 PM
To: user@hbase.apache.org
Subject: Re: question about composite rowKey and performance difference between getScanner()
and get(Get[])

I assume you have read http://hbase.apache.org/book.html#schema.casestudies
(See 6.11.3)

What's the size of data that is not A or B's uniqueIds ? The answer is related to the amount
of data redundancy that you are comfortable with in your design.

Cheers

On Wed, Dec 3, 2014 at 12:31 PM, Marc Sturm <mas9161@nyp.org> wrote:

> Hi,
>
> I have a many to many relationship that I am trying to model in hbase, 
> and I want to be sure I am not missing anything so please let me know 
> or point to the right documentation.
>
> Let's say I have an A to B many to many relationship, the query 
> parameter takes A unique id and returns all the B uniqueids related to 
> A with their properties and values.
>
> The first solution I found is having two tables: one with the rowKey 
> equal to A's unique id, the table column identifiers are equal to B's 
> unique ids related to A, the second table has its rowKeys equal to B 
> unique ids and its columns contain the property values. So the query 
> is two steps, it first does a get on A to collect all the B uniqueIds 
> and then does a second get on the B passing as a parameter an array of 
> B rowkeys. When I run the second query, I can get a latency much 
> longer on the first query and then good low latency on subsequent 
> queries with same parameter. I believe that's a caching issue...
>
> The second solution is having one table with a composite rowkey equal 
> to A uniqueid + B uniqueid, I will then have duplicate B uniqueid 
> rows. But when I do a scan on the just the first part of the rowKey (A 
> uniqueid) the response time and latency is more consistent and better (smaller).
>
> So, my questions are threefold: 1) which way is the best, 2) what is 
> the performance difference between a scan and a get with multiple 
> rowkeys (I think scan is faster because the data is not or less 
> "distributed") and 3) how can we make the get with multiple rowkeys more consistent?
>
> Thank you for your help,
> Marc
>
> This electronic message is intended to be for the use only of the 
> named recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that 
> any disclosure, copying, distribution or use of the contents of this 
> message is strictly prohibited.  If you have received this message in 
> error or are not the named recipient, please notify us immediately by 
> contacting the sender at the electronic mail address noted above, and 
> delete and destroy all copies of this message.  Thank you.

This electronic message is intended to be for the use only of the named recipient, and may
contain information that is confidential or privileged.  If you are not the intended recipient,
you are hereby notified that any disclosure, copying, distribution or use of the contents
of this message is strictly prohibited.  If you have received this message in error or are
not the named recipient, please notify us immediately by contacting the sender at the electronic
mail address noted above, and delete and destroy all copies of this message.  Thank you.
Mime
View raw message