hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: help with map-reduce
Date Tue, 07 Apr 2009 13:48:35 GMT
Hi Rakhi,

The way the filters work is that you either use the supplied filters or 
create your own subclasses - but then you will have to deploy that class 
to all RegionServers while adding it to their respective hbase-env.sh 
(in the "export HBASE_CLASSPATH" variable). We are discussing currently 
if this could be done dynamically 

Once you have that done or use one of the supplied one then you can 
assign the filter by overriding the TableInputFormat's configure() 
method and assign it like so:

  public void configure(JobConf job) {
      RegExpRowFilter filter = new RegExpRowFilter("ABC.*");

As Tim points out, setting the whole thing up is done in your main M/R 
tool based application, similar to:

  JobConf job = new JobConf(...);
  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>", 
    ImmutableBytesWritable.class, RowResult.class, job);

Of course depending on what classes you want to replace or if this is a 
Reduce oriented job (means a default identity + filter map and all the 
work done in the Reduce phase) or the other way around. But the 
principles and filtering are the same.


Rakhi Khatwani wrote:
> Thanks Ryan, i will try that
> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>> there is a server-side mechanism to filter rows, it's found in the
>> org.apache.hadoop.hbase.filter package.  im not sure how this interops with
>> the TableInputFormat exactly.
>> setting a filter to reduce the # of rows returned is pretty much exactly
>> what you want.
>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>> wrote:
>>> Hi,
>>>     i have a map reduce program with which i read from a hbase table.
>>> In my map program i check if the column value of a is xxx, if yes then
>>> continue with processing else skip it.
>>> however if my table is really big, most of my time in the map gets wasted
>>> for processing unwanted rows.
>>> is there any way through which we could send a subset of rows (based on
>> the
>>> value of a particular column family) to the map???
>>> i have also gone through TableInputFormatBase but am not able to figure
>> out
>>> how do we set the input format if we are using TableMapReduceUtil class
>> to
>>> initialize table map jobs. or is there any other way i could use it.
>>> Thanks in Advance,
>>> Raakhi.

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message