hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: help with map-reduce
Date Tue, 07 Apr 2009 13:48:35 GMT
Hi Rakhi,

The way the filters work is that you either use the supplied filters or 
create your own subclasses - but then you will have to deploy that class 
to all RegionServers while adding it to their respective hbase-env.sh 
(in the "export HBASE_CLASSPATH" variable). We are discussing currently 
if this could be done dynamically 
(https://issues.apache.org/jira/browse/HBASE-1288).

Once you have that done or use one of the supplied one then you can 
assign the filter by overriding the TableInputFormat's configure() 
method and assign it like so:

  public void configure(JobConf job) {
      RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
      setRowFilter(filter);
  }

As Tim points out, setting the whole thing up is done in your main M/R 
tool based application, similar to:

  JobConf job = new JobConf(...);
  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>", 
IdentityTableMap.class,
    ImmutableBytesWritable.class, RowResult.class, job);
  job.setReducerClass(MyTableReduce.class);
  job.setInputFormat(MyTableInputFormat.class);
  job.setOutputFormat(MyTableOutputFormat.class);

Of course depending on what classes you want to replace or if this is a 
Reduce oriented job (means a default identity + filter map and all the 
work done in the Reduce phase) or the other way around. But the 
principles and filtering are the same.

HTH,
Lars


Rakhi Khatwani wrote:
> Thanks Ryan, i will try that
>
> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
>   
>> there is a server-side mechanism to filter rows, it's found in the
>> org.apache.hadoop.hbase.filter package.  im not sure how this interops with
>> the TableInputFormat exactly.
>>
>> setting a filter to reduce the # of rows returned is pretty much exactly
>> what you want.
>>
>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>     
>>> wrote:
>>>       
>>> Hi,
>>>     i have a map reduce program with which i read from a hbase table.
>>> In my map program i check if the column value of a is xxx, if yes then
>>> continue with processing else skip it.
>>> however if my table is really big, most of my time in the map gets wasted
>>> for processing unwanted rows.
>>> is there any way through which we could send a subset of rows (based on
>>>       
>> the
>>     
>>> value of a particular column family) to the map???
>>>
>>> i have also gone through TableInputFormatBase but am not able to figure
>>>       
>> out
>>     
>>> how do we set the input format if we are using TableMapReduceUtil class
>>>       
>> to
>>     
>>> initialize table map jobs. or is there any other way i could use it.
>>>
>>> Thanks in Advance,
>>> Raakhi.
>>>
>>>       
>
>   

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message