hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Join in HBase
Date Fri, 17 Jul 2009 15:57:50 GMT
I didn't see much of an algorithm beyond simple usage of MR.

You read in everything, output the column value as the key to the 
reduce.  They are then combined and you have the join in the reduce.

That's the general way you would do something simple like that in 
MapReduce.  But this is not MR, there is no shuffle/sort/reduce.

Not sure how much more we can help without knowing the specifics of what 
you want to do.  Hoberto provided some nice examples and breakdown of 
how you might solve them, that should help.


bharath vissapragada wrote:
> Hi ,
> Did you see the algo of the map-red join i have implemented (i have written
> it in the end of my prev mail.). Any comments abt that .. I mean any
> improvements and stuff .
> On Thu, Jul 16, 2009 at 11:12 PM, Jonathan Gray <jlist@streamy.com> wrote:
>> Hoberto, Bharath,
>> Designing these kinds of queries efficiently in HBase means doing multiple
>> round-trips, or denormalizing.
>> That means degrading performance as the query complexity increases, or lots
>> of data duplication and a complex write/update process.
>> In your audit example, you provide the denormalizing solution.  Store the
>> fields you need with the data you are querying (details of the user/action
>> in the audit table with the audit).  If you have to update those details,
>> then you have an extra expense on your write (and you introduce a potential
>> synchronization issue without transactions).
>> The choice about how to solve this really depends on the use case and what
>> your requirements are.  Can you ever afford to miss an update in one of the
>> denormalized fields, even if it is extremely unlikely?  You can build
>> transactional layers on top or you can take a look at TransactionalHBase
>> which attempts to do this in a more integrated way.
>> You also talk about the other approach, running multiple queries in the
>> application.  As far as memory pressure in the app is concerned, that would
>> really depend on the nature of the join.  It's more an issue of how many
>> joins you need to make, and if there's any way to reduce the number of
>> queries/trips needed.
>> If I am pulling the most recent 10 audits, and I need to join each with
>> both the User and Actions table, then we're talking about 1 + 10 + 10 total
>> queries.  That's not so pretty, but if done in a distributed or threaded way
>> may not be too bad.  In the future, I expect more and more tools/frameworks
>> available to aid in that process.
>> Today, this is up to you.
>> At Streamy, we solve these problems with layers above HBase.  Some of them
>> keep lots of stuff in-memory and do complex joins in-memory. Others
>> coordinate the multiple queries to HBase, with or without an OCC-style
>> transaction.
>> My suggestion is to start denormalized.  Build naive queries that do lots
>> of round-trips.  See what the performance is like under different conditions
>> and then go from there.  Perhaps Actions are generally immutable, their name
>> never changes, so you could denormalize that field and cut out half of the
>> total queries.  Have a pool of threads that grab Users so you can do the
>> join in parallel.  Depending on your requirements, this might be sufficient.
>>  Otherwise look at more denormalization, or building a thin layer above.
>> JG
>> Mr Hoberto wrote:
>>> I can think of two cases that I've been wondering about (I am very new,
>>> and
>>> am still reading the docs & archives, so I apologize if this has been
>>> already covered or if I use the wrong notation...I'm still learning).
>>> First case:
>>> Tracking audits. In the RDMBS world you'd have the following schema:
>>> User (userid, firstname, lastname)
>>> Actions (actionid, actionname)
>>> Audit (auditTime, userid, actionid)
>>> I think the answer in the HBase world is to denormalize the data...have a
>>> structure such as:
>>> audits (auditid, audittime[timestamp], whowhat[family (firstName,
>>> lastname,
>>> actionname)])
>>> The problem happens, as Bharath says, what if a firstName or LastName
>>> needs
>>> to be updated? Running a correction on all those denormalized rows is
>>> going
>>> to be problematic.
>>> Alternatively, I suppose you could store the User and Actions tables
>>> separately, and keep the audits structure in HBase storing only IDs , and
>>> use the website's application layer to "merge" the different data sets
>>> together for display on a page. The downside there is if you wind up with
>>> a
>>> significant amount of users or actions, it'll put a lot of memory pressure
>>> on the app servers.
>>> Second case:
>>> Doing analysis on two time-series based data structures, such as a "PE
>>> Ratio"
>>> In the RDBS world you'd have two tables:
>>> Prices (ticker, date, price)
>>> Earnings (ticker, date, earning)
>>> Again, I think the answer is denormalizing in the HBase world, with a
>>> structure such as:
>>> PEs (date, timestamp, PERatio[family (ticker, PEvalue)])
>>> The problem here comes, again, with updates. For instance, what if you
>>> only
>>> have available earnings information on an annual basis, and you've come
>>> across a source that has it quarterly....You'll have to update 3/4 of the
>>> rows in the denormalized table.
>>> Once again, I apologize for any sort of misunderstanding..I'm still
>>> learning
>>> the concepts behind column stores and map/reduce.
>>> -hob
>>> On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <jlist@streamy.com>
>>> wrote:
>>>  The answer is, it depends.
>>>> What are the details of what you are trying to join?  Is it just a simple
>>>> 1-to-1 join, or 1-to-many or what?  At a minimum, the join would require
>>>> two
>>>> round-trips.  However, 0.20 can do simple queries in the 1-10ms
>>>> time-range
>>>> (closer to 1ms when the blocks are already cached).
>>>> The comparison to an RDBMS cannot be made directly because a single-node
>>>> RDBMS with a smallish table will be quite fast at simple index-based
>>>> joins.
>>>>  I would guess that unloaded, single machine performance of this join
>>>> operation would be much faster in an RDBMS.
>>>> But if your table has millions or billions of rows, it's a different
>>>> situation.  HBase performance will stay nearly constant as your table
>>>> increases, as long as you have the nodes to support your dataset and the
>>>> load.
>>>> What are your targets for time (sub 100ms? 10ms?), and what are the
>>>> details
>>>> of what you're joining?
>>>> As far as code is concerned, there is not much to a simple join, so I'm
>>>> not
>>>> sure how helpful it would be.  If you give some detail perhaps I can
>>>> provide
>>>> some pseudo-code for you.
>>>> JG
>>>> bharath vissapragada wrote:
>>>>  JG thanks for ur reply,
>>>>> Actually iam trying to implement a realtime join of two tables on HBase
>>>>> .
>>>>> Actually i tried the idea of denormalizing the tables to avoid the Joins
>>>>> ,
>>>>> but when we do that Updating the data is really difficult .  I
>>>>> understand
>>>>> that the features i am trying to implement are that of a RDBMS and HBase
>>>>> is
>>>>> used for a different purpose . Even then i want (rather i would like
>>>>> try)
>>>>> to store the data  the data in HBase and implement Joins so that i
>>>>>  could
>>>>> test its performance and if its effective (atleast on large number of
>>>>> nodes)
>>>>> , it maybe of somehelp to me . I know some ppl have already tried this
>>>>> If
>>>>> anyone of already tried this can you just tellme how the results are
>>>>> i
>>>>> mean are they good , when compared to RDBMS join on a single machine
>>>>> Thanks
>>>>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <jlist@streamy.com>
>>>>> wrote:
>>>>>  Bharath,
>>>>>> You need to outline what your actual requirements are if you want
>>>>>> help.  Open-ended questions that just ask for code are usually not
>>>>>> answered.
>>>>>> What exactly are you trying to join?  Does this join need to happen
>>>>>> "realtime" or is this part of a batch process?
>>>>>> Could you denormalize your data to prevent needing the join at runtime?
>>>>>> If you provide details about exactly what your data/schema is like
>>>>>> a
>>>>>> similar example if this is confidential), then many of us are more
>>>>>> happy to help you figure out what approach my work best.
>>>>>> When working with HBase, figuring out how you want to pull your data
>>>>>> out
>>>>>> is
>>>>>> key to how you want to put the data in.
>>>>>> JG
>>>>>> bharath vissapragada wrote:
>>>>>>  Amandeep , can you tell me what kinds of joins u have implemented
>>>>>> and
>>>>>>> which works the best (based on observation ).. Can u show us
>>>>>>> source
>>>>>>> code
>>>>>>> (if possible)
>>>>>>> Thanks in advance
>>>>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <amansk@gmail.com>
>>>>>>> wrote:
>>>>>>>  I've been doing joins by writing my own MR jobs. That works
>>>>>>>  Not tried cascading yet.
>>>>>>>> -ak
>>>>>>>> On 7/14/09, bharath vissapragada <bharathvissapragada1990@gmail.com>
>>>>>>>> wrote:
>>>>>>>>  Thats fine .. I know that hbase has completely different
>>>>>>>>> compared
>>>>>>>>>  to
>>>>>>>>  SQL .. But for my application there is some kind of dependency
>>>>>>>>> involved
>>>>>>>>> among the tables . So i need to implement a Join . I
wanted to know
>>>>>>>>>  whether
>>>>>>>>  there is some kind of implementation already
>>>>>>>>> ..
>>>>>>>>> Thanks
>>>>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>  HBase != SQL.
>>>>>>>>> You might want map reduce or cascading.
>>>>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath
>>>>>>>>>> vissapragada<bharat_v@students.iiit.ac.in>
>>>>>>>>>>  Hi all ,
>>>>>>>>>>> I want to join(similar to relational databases
join) two tables in
>>>>>>>>>>>  HBase
>>>>>>>>>> .
>>>>>>>>>  Can anyone tell me whether  it is already implemented
in the source
>>>>>>>>>> !
>>>>>>>>>>> Thanks in Advance
>>>>>>>>>>>  --
>>>>>>>> Amandeep Khurana
>>>>>>>> Computer Science Graduate Student
>>>>>>>> University of California, Santa Cruz

View raw message