hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: MapReduce job with mixed data sources: HBase table and HDFS files
Date Fri, 05 Jul 2013 21:06:18 GMT
Actually you can, albeit it will be slower than you would think.

You'd have to do a single threaded scan to pull the data from the remote cluster to the local
cluster then once its local you can parallelize the HDFS m/r portion of the job. 

Note: Can do some thing versus can't do something doesn't mean its going to be a good idea.

An alternative would be for the client to run a m/r job on the remote cluster which then writes
to the second cluster.  This will parallelize the initial scan. 

On Jul 3, 2013, at 8:02 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:

> Hi,
> 1) It cannot input two different cluster's data to a MR job.
> 2) If your data locates in the same cluster, then:
> 
>    conf.set(TableInputFormat.SCAN,
> TableMapReduceUtil.convertScanToString(new Scan()));
>    conf.set(TableInputFormat.INPUT_TABLE, tableName);
> 
>    MultipleInputs.addInputPath(conf, new Path(input_on_hdfs),
> TextInputFormat.class, MapperForHdfs.class);
>    MultipleInputs.addInputPath(conf, new Path(input_on_hbase),
> TableInputFormat.class, MapperForHBase.class);*
> 
> *
> but,
> new Path(input_on_hbase) can be any path, it make no sense.*
> 
> *
> Please refer to
> org.apache.hadoop.hbase.mapreduce.IndexBuilder for how to read table in the
> MR job under $HBASE_HOME/src/example*
> 
> 
> 
> *
> 
> 
> On Thu, Jul 4, 2013 at 5:19 AM, Michael Segel <michael_segel@hotmail.com>wrote:
> 
>> You may want to pull your data from your HBase first in a separate map
>> only job and then use its output along with other HDFS input.
>> There is a significant disparity between the reads from HDFS and from
>> HBase.
>> 
>> 
>> On Jul 3, 2013, at 10:34 AM, S. Zhou <myxjtu@yahoo.com> wrote:
>> 
>>> Azuryy, I am looking at the MultipleInputs doc. But I could not figure
>> out how to add HBase table as a Path to the input? Do you have some sample
>> code? Thanks!
>>> 
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Azuryy Yu <azuryyyu@gmail.com>
>>> To: user@hbase.apache.org; S. Zhou <myxjtu@yahoo.com>
>>> Sent: Tuesday, July 2, 2013 10:06 PM
>>> Subject: Re: MapReduce job with mixed data sources: HBase table and HDFS
>> files
>>> 
>>> 
>>> Hi ,
>>> 
>>> Use MultipleInputs, which can solve your problem.
>>> 
>>> 
>>> On Wed, Jul 3, 2013 at 12:34 PM, S. Zhou <myxjtu@yahoo.com> wrote:
>>> 
>>>> Hi there,
>>>> 
>>>> I know how to create MapReduce job with HBase data source only or HDFS
>>>> file as data source. Now I need to create a MapReduce job with mixed
>> data
>>>> sources, that is, this MR job need to read data from both HBase and HDFS
>>>> files. Is it possible? If yes, could u share some sample code?
>>>> 
>>>> Thanks!
>>>> Senqiang
>> 
>> 


Mime
View raw message