hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ioan Eugen Stan <stan.ieu...@gmail.com>
Subject Re: advice needed on storing large objects on hdfs
Date Mon, 30 Jan 2012 08:11:24 GMT
Pe 30.01.2012 09:53, Rohit Kelkar a scris:
> Hi Stack,
> My problem is that I have large number of smaller objects and a few
> larger objects. My strategy is to store smaller objects (size<  5MB)
> in hbase and larger objects (size>  5MB) on hdfs. And I also want to
> run MapReduce tasks on those objects. Loan suggested that I should put
> all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
> the reference of the object stored in the file. Now if I run a
> mapreduce task, my mapper would be run locally wrt the object
> references and not the actual dfs block where the object resides.
> - Rohit Kelkar

Hi Rohit,

First my name is Ioan (with i), second. It's a tricky question. If you 
run MapReduce with input from HBase you will have data locality for 
HBase data and not from the data in your SequenceFiles. You could get 
data locality from those if you perform a pre-setup job that scans HBase 
and builds a list of files to process and then runs another MR job on 
Hadoop targeting the SequenceFile. I think you can find ways to optimize 
the pre-process step to be fast.

The set-up that I described is more suitable for situations when you 
need to stream data that it's larger then a HBase is recommended to 
handle like mailboxes with large attachements. I'm planning to implement 
it soon in Apache James's HBase mailbox implementation to deal with 
large inboxes.


Ioan Eugen Stan

View raw message