hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Keywal <nkey...@gmail.com>
Subject Re: HBase (BigTable) many to many with students and courses
Date Tue, 29 May 2012 16:19:25 GMT
Hi,

For the multiget, if it's small enough, it will be:
- parallelized on all region servers concerned. i.e. you will be as
fast as the slowest region server.
- there will be one query per region server (i.e. gets are grouped by
region server).

If there are too many gets, it will be split in small subsets and the
strategy above will be used for each subset, doing one subset after
another (and blocking between them).

so Large set  --> Small set will be ok from this point of view. Large
--> Large won't.

N.


On Tue, May 29, 2012 at 5:54 PM, Em <mailformailinglists@yahoo.de> wrote:
> Ian,
>
> thanks for your detailed response!
>
> Let me give you feedback to each point:
>> 1. You could denormalize the additional information (e.g. course
>> name) into the students table. Then, you're simply reading the
>> student row, and all the info you need is there. That places an extra
>> burden of write time and disk space, and does make you do a lot more
>> work when a course name changes.
> That's exactly what I thought about and that's why I avoid it. The
> students and courses example is an example you find at several points on
> the web, when describing the differences and translations of relations
> from an RDBMS into a Key-Value-store.
> In fact, everything you model with a Key-Value-storage like HBase,
> Cassandra etc. can be modeled as an RDMBS-scheme.
> Since a lot of people, like me, are coming from that edge, we must
> re-learn several basic things.
> It starts with understanding that you model a K-V-storage the way you
> want to access the data, not as the data relates to eachother (in
> general terms) and ends with translating the connections of data into a
> K-V-schema as good as possible.
>
>
>> 2. You could do what you're talking about in your HBase access code:
>> find the list of course IDs you need for the student, and do a multi
>> get on the course table. Fundamentally, this won't be much more
>> efficient to do in batch mode, because the courses are likely to be
>> evenly spread out over the region servers (orthogonal to the
>> students). You're essentially doing a hash join, except that it's a
>> lot less pleasant than on a relational DB b/c you've got network
>> round trips for each GET. The disk blocks from the course table (I'm
>> assuming it's the smaller side) will likely be cached so at least
>> that part will be fast--you'll be answering those questions from
>> memory, not via disk IO.
>
> Whow, what?
> I thought a Multiget would reduce network-roundtrips as it only accesses
> each region *one* time, fetching all the queried keys and values from
> there. If your data is randomly distributed, this could result in the
> same costs as with doing several Gets in a loop, but should work better
> if several Keys are part of the same region.
> Am I right or did I missunderstood the concept???
>
>> 3. You could also let a higher client layer worry about this. For
>> example, your data layer query just returns a student with a list of
>> their course IDs, and then another process in your client code looks
>> up each course by ID to get the name. You can then put an external
>> caching layer (like memcached) in the middle and make things a lot
>> faster (though that does put the burden on you to have the code path
>> for changing course info also flush the relevant cache entries). In
>> your example, it's unlikely any institution would have more than a
>> few thousand courses, so they'd probably all stay in memory and be
>> served instantaneously.
> Hm, in what way does this give me an advantage over using HBase -
> assuming that the number of courses is small enough to fit in RAM - ?
> I know that Memcached is optimized for this purpose and might have much
> faster response times - no doubts.
> However, from a conceptual point of view: Why does Memcached handles the
> K-V-distribution more efficiently than a HBase with warmed caches?
> Hopefully this question isn't that hard :).
>
>> This might seem laborious, and to a degree it is. But note that it's >
> difficult to see the utility of HBase with toy examples like this; if >
> you're really storing courses and students, don't use HBase (unless
>> you've got billions of students and courses, which seems unlikely).
>> The extra thought you have to put in to making schemas work for you
>> in HBase is only worth it when it gives you the ability to scale to
>> gigantic data sets where other solutions wouldn't.
> Well, the background is a private project. I know that it's a lot easier
> to do what I want in a RDBMS and there is no real need for using a
> highly scalable beast like HBase.
> However, I want to learn something new and since I do not break
> someone's business by trying out new technology privately, I want to go
> with HStack.
> Without ever doing it, you never get a real feeling of when to use the
> right tool.
> Using a good tool for the wrong problem can be an interesting
> experience, since you learn some of the do's and don'ts of the software
> you use.
>
> Since I am a reader of the MEAP-edition of HBase in Action, I am aware
> of the TwitBase-example application presented in that book.
> I am very interested in seeing the author presenting a solution for
> efficiently accessing the Tweets of the persons I follow.
> This is an n:m-relation.
> You got n users with m tweets and each user is seeing his own tweets as
>  well as the tweets of followed persons in descending order by timestamp.
> This must be done with a join within an RDMBs (and maybe in HBase also),
> since I can not think of another scalable way of doing so.
>
> However, if you do this by a Join, this means that a person with 40.000
> followers needs a batch-request consisting of 40.000 GET-objects. That's
> huge and I bet that this is everything but not fast nor scalable. It
> sounds like broken by design when designing for Big Data.
> Therefore I am interested in general best practices for such problems.
>
> Maybe this is a better example for showing the possibilities of HBase
> than a students and courses example.
>
> Thanks for sharing your insights!
>
> Em
>
>
> Am 29.05.2012 17:08, schrieb Ian Varley:
>> Em,
>>
>> What you're describing is a classic relational database nested loop or hash join;
the only difference is that relational databases have this feature built in, and can do it
very efficiently because they typically run on a single machine, not a distributed cluster.
By moving to HBase, you're explicitly making a tradeoff that's worse for this kind of usage,
in exchange for having horizontally scalable data storage (i.e. you can scale to TB or PB
of data). But the reality is that this makes what you're describing a lot harder to do.
>>
>> A real answer to this question would involve talking a lot about JOIN theory in relational
databases: when do optimizers choose nested loop joins vs. hash joins or merge joins? How
do you know which side of a join to drive from (HBase doesn't keep stats, nor does it have
an optimizer for that matter). There's not really a general "what's the right way to do this",
divorced from those kinds of questions.
>>
>> That said, I can see at least a couple ways to make this particular operation (get
all courses for one student) efficient in HBase:
>>
>> 1. You could denormalize the additional information (e.g. course name) into the students
table. Then, you're simply reading the student row, and all the info you need is there. That
places an extra burden of write time and disk space, and does make you do a lot more work
when a course name changes.
>>
>> 2. You could do what you're talking about in your HBase access code: find the list
of course IDs you need for the student, and do a multi get on the course table. Fundamentally,
this won't be much more efficient to do in batch mode, because the courses are likely to be
evenly spread out over the region servers (orthogonal to the students). You're essentially
doing a hash join, except that it's a lot less pleasant than on a relational DB b/c you've
got network round trips for each GET. The disk blocks from the course table (I'm assuming
it's the smaller side) will likely be cached so at least that part will be fast--you'll be
answering those questions from memory, not via disk IO.
>>
>> 3. You could also let a higher client layer worry about this. For example, your data
layer query just returns a student with a list of their course IDs, and then another process
in your client code looks up each course by ID to get the name. You can then put an external
caching layer (like memcached) in the middle and make things a lot faster (though that does
put the burden on you to have the code path for changing course info also flush the relevant
cache entries). In your example, it's unlikely any institution would have more than a few
thousand courses, so they'd probably all stay in memory and be served instantaneously.
>>
>> This might seem laborious, and to a degree it is. But note that it's difficult to
see the utility of HBase with toy examples like this; if you're really storing courses and
students, don't use HBase (unless you've got billions of students and courses, which seems
unlikely). The extra thought you have to put in to making schemas work for you in HBase is
only worth it when it gives you the ability to scale to gigantic data sets where other solutions
wouldn't.
>>
>> Ian
>>
>> On May 29, 2012, at 9:28 AM, Em wrote:
>>
>>> Hi,
>>>
>>> thanks for your help.
>>> Yes, I know these slides.
>>> However I can not find an answer to how to access such schemas efficiently.
>>> In case of the given schema for students and courses as in those slides,
>>> they say that each column contains the student's id / course's id.
>>> However, when you want to build a GUI, you want to get all the courses
>>> for a given student and display their names.
>>> You *have* the column-names which represent the ids of the courses,
>>> however to get the human readable name of a course, you have to access
>>> the course-table.
>>>
>>> I understand the schema, agree with it, but my question was how to
>>> access this data efficiently within an application / how to implement
>>> the needed behaviour efficiently.
>>>
>>> Thanks! :)
>>> Em
>>>
>>> Am 29.05.2012 12:49, schrieb shashwat shriparv:
>>>> Check out this link may be it will help you somewhat:
>>>>
>>>> http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies
>>>>
>>>> On Tue, May 29, 2012 at 4:09 PM, Michel Segel <michael_segel@hotmail.com>wrote:
>>>>
>>>>> Depends...
>>>>> Try looking at a hierarchical model rather than a relational model...
>>>>>
>>>>> One thing to remember is that joins are expensive in HBase.
>>>>>
>>>>>
>>>>>
>>>>> Sent from a remote device. Please excuse any typos...
>>>>>
>>>>> Mike Segel
>>>>>
>>>>> On May 28, 2012, at 12:50 PM, Em <mailformailinglists@yahoo.de>
wrote:
>>>>>
>>>>>> Hello list,
>>>>>>
>>>>>> I have some time now to try out HBase and want to use it for a private
>>>>>> project.
>>>>>>
>>>>>> Questions like "How to I transfer one-to-many or many-to-many relations
>>>>>> from my RDBMS's schema to HBase?" seem to be common.
>>>>>>
>>>>>> I hope we can throw all the best practices that are out there in
this
>>>>>> thread.
>>>>>>
>>>>>> As the wiki states:
>>>>>> One should create two tables.
>>>>>> One for students, another for courses.
>>>>>>
>>>>>> Within the students' table, one should add one column per selected
>>>>>> course with the course_id besides some columns for the student itself
>>>>>> (name, birthday, sex etc.).
>>>>>>
>>>>>> On the other hand one fills the courses table with one column per
>>>>>> student_id besides some columns which describe the course itself
(name,
>>>>>> teacher, begin, end, year, location etc.).
>>>>>>
>>>>>> So far, so good.
>>>>>>
>>>>>> How do I access these tables efficiently?
>>>>>>
>>>>>> A common case would be to show all courses per student.
>>>>>>
>>>>>> To do so, one has to access the student-table and get all the student's
>>>>>> courses-columns.
>>>>>> Let's say their names are prefixed ids. One has to remove the prefix
and
>>>>>> then one accesses the courses-table to get all the courses and their
>>>>>> metadata (name, teacher, location etc.).
>>>>>>
>>>>>> How do I do this kind of operation efficiently?
>>>>>> The naive and brute force approach seems to be using a Get-object
per
>>>>>> course and fetch the neccessary data.
>>>>>> Another approach seems to be using the HTable-class and unleash the
>>>>>> power of "multigets" by using the batch()-method.
>>>>>>
>>>>>> All of the information above is theoretically, since I did not used
it
>>>>>> in code (I currently learn more about the fundamentals of HBase).
>>>>>>
>>>>>> That's why I give the question to you: How do you do this kind of
>>>>>> operation by using HBase?
>>>>>>
>>>>>> Kind regards,
>>>>>> Em
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>

Mime
View raw message