hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay Kay <kaykay.uni...@gmail.com>
Subject Re: Hbase as Map/Reduce source
Date Fri, 29 Jan 2010 06:15:53 GMT
If you are talking about updates / deletes - then I would imagine you
definitely have the notion of some primary key of reference.

As far as handling deletes - from a schema design perspective of HBase -
it might be required to have a secondary insert-only schema for storing
delete transactions exclusively and the MR pipeline can periodically
scan the insert-only schema to take note of deletions.

To handle updates on a single family table (as a trivial case) - while
storing the updated snapshot in a table is relatively straight-forward,
from the point of capturing the update transactions - it might be
necessary to have a secondary table to take care of that ( like a meta
index ) since scanning through the table to look for updations, even if
it were a M-R process, would be expensive.

The actual decision depends on the frequency of delete/update
transactions of the schema under consideration and the *width* of the
column family changes, in terms of storing the transaction representations.





On 01/28/2010 09:06 PM, y_823910@tsmc.com wrote:
> What about if I want to analyse the data which have update and delete
> record.
> In this scenario, hbase is a good M/R source better than hdfs raw file , is
> it correct?
>
> Fleming Chiu(邱宏明)
> 707-6128
> y_823910@tsmc.com
> 週一無肉日吃素救地球(Meat Free Monday Taiwan)
>
>
>
>
>                                                                                     
                                                                 
>                       Kay Kay                                                       
                                                                 
>                       <kaykay.unique@gm        To:      hbase-user@hadoop.apache.org
                                                                 
>                       ail.com>                 cc:      (bcc: Y_823910/TSMC)     
                                                                    
>                                                Subject: Re: Hbase as Map/Reduce source
                                                               
>                       2010/01/29 11:05                                              
                                                                 
>                       AM                                                            
                                                                 
>                       Please respond to                                             
                                                                 
>                       hbase-user                                                    
                                                                 
>                                                                                     
                                                                 
>                                                                                     
                                                                 
>
>
>
>
> HDFS is a double-edged sword . Being a raw file system - you can feed it
> to a Map Reduce program although it might be necessary to define
> InputSplit-s as appropriate to chop down the input size.
>
> OTOH, HBase is structured data ( well - sort of ! ) using a file format
> on top of HDFS to store the schema and hence comes with predefined
> InputSplit-s that make it easy to get started on a MapReduce program.
>  From an API simplicity point of view - HBase can get you started
> relatively faster because of it ( assuming you have your data in hbase).
>
> Refer to -
> http://wiki.apache.org/hadoop/Hbase/MapReduce .
>
> Although the wiki says deprecated - in reality - it is suggested to
> stick with  *.mapred.* packages for some time since the underlying
> .mapreduce.* packages are not mature enough at this point.
>
> The decision is to entirely do with - the kind of the data you have and
> identifying the data by a primary key amenable to your application,
> which is all hbase in its rudimentary form needs.
>
> On the other hand - if having a schema and defining a primary key for
> your data seems non-orthogonal for your app - you can stick with HDFS
> and a custom InputSplit depending on your data.  Especially since HBase
> provides a lot more than HDFS in terms of scanning / row id ordering and
> if these features are not necessary for what you do - then storing data
> in HDFS should be just about ok.
>
>
>
>
> On 1/28/10 6:20 PM, Otis Gospodnetic wrote:
>   
>> I asked a similar question recently:
>>
>>     
> http://search-hadoop.com/m?id=843956.53875.qm@web50305.mail.re2.yahoo.com||hbase%20mapreduce%20otis%20TableInputFormat
>
>   
>>
>> Otis
>>
>>
>>
>> ----- Original Message ----
>>
>>     
>>> From: "y_823910@tsmc.com"<y_823910@tsmc.com>
>>> To: hbase-user@hadoop.apache.org
>>> Sent: Thu, January 28, 2010 8:02:39 PM
>>> Subject: Hbase as Map/Reduce source
>>>
>>> Hi,
>>>
>>> I want to understand clearly about Hbase as Map/Reduce source.
>>> Basicly, if a table with 100 regions, it means 100 map will be started,
>>> right?
>>> What's the difference between hdfs and hbase as a Map/Reduce source?
>>> Thanks
>>>
>>>
>>>
>>>
>>> Fleming Chiu(邱宏明)
>>> 707-6128
>>> y_823910@tsmc.com
>>> 週一無肉日吃素救地球(Meat Free Monday Taiwan)
>>>
>>>
>>>
>>>       
> ---------------------------------------------------------------------------
>   
>>>                                                           TSMC PROPERTY
>>> This email communication (and any attachments) is proprietary
>>>       
> information
>   
>>> for the sole use of its
>>> intended recipient. Any unauthorized review, use or distribution by
>>>       
> anyone
>   
>>> other than the intended
>>> recipient is strictly prohibited.  If you are not the intended
>>>       
> recipient,
>   
>>> please notify the sender by
>>> replying to this email, and then delete this email and any copies of it
>>> immediately. Thank you.
>>>
>>>       
> ---------------------------------------------------------------------------
>   
>>>       
>>     
>
>
>
>
>  --------------------------------------------------------------------------- 
>                                                          TSMC PROPERTY       
>  This email communication (and any attachments) is proprietary information   
>  for the sole use of its                                                     
>  intended recipient. Any unauthorized review, use or distribution by anyone  
>  other than the intended                                                     
>  recipient is strictly prohibited.  If you are not the intended recipient,   
>  please notify the sender by                                                 
>  replying to this email, and then delete this email and any copies of it     
>  immediately. Thank you.                                                     
>  --------------------------------------------------------------------------- 
>
>
>
>   


Mime
View raw message