hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: help with map-reduce
Date Thu, 09 Apr 2009 15:38:28 GMT
Hi Rahki,

Looking through the code of the ColumnValueFilter again, it seems it 
does what you want when you add the extra "filterIfColumnMissing" 
parameter to the constructor and set it to "false". The default "true" 
does the column filtering and will return all rows that have that 
column. Setting it to true activates the "filterRow()" (although I am 
not sure yet where that is called - the others I can see in the 
StoreScanner class in use) to filter rows out that do not have a column 
match - which is what you want. Of course you still need to invert the 
check as mentioned in the previous email.

Lars

Rakhi Khatwani wrote:
> Hi Lars,
>                  Hmm... i had a look at other filters.. but i thought
> ColumnValueFilter would be more appropriate coz in the constructor we could
> mention the column name and the value.
> Probably i am going wrong there.
>
> what i want is to filter out all the rows based on some column value. what
> do you suggest??.
>
> thanks a ton
> Rakhi
>
> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <lars@worldlingo.com> wrote:
>
>   
>> Hi Rakhi,
>>
>> Sorry, not yet. This is not an easy thing to replicate. I will try though
>> over the next few days if I find time. A few things to note though first.
>> The way filters work is that they do *not* let filtered rows through but
>> actually filters them out. That means you logic seems reversed:
>>
>>  final RowFilterInterface colFilter = new
>> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>>   "UNCOLLECTED".getBytes());
>>  setRowFilter(colFilter);
>>
>>
>> I think you *want* the uncollected columns to be processed? At least that
>> is what you said below :) So you will have to filter all other rows out of
>> the set that are NOT EQUAL to "UNCOLLECTED".
>>
>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you systems
>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that should
>> of course match the way you store those strings in the first place. The
>> filters do a byte level compare so that is very sensitive.
>>
>> This does not address yet why you see both values or have matches at all.
>> It rather sounds like the filter is not active?
>>
>> And lastly, using the ColumnValueFilter will always let throw all rows! It
>> is designed to strip out the columns of each row, but not filter on the row
>> itself. Is that what you want? If not you may have to use a different filter
>> class.
>>
>>
>> Lars
>>
>>
>> Rakhi Khatwani wrote:
>>
>>     
>>> Hi Lars,
>>>              Just wanted to follow up, did you try out the column value
>>> filter? did it work??
>>> i really need it to improve the performance of my map-reduce programs.
>>>
>>> Thanks a ton,
>>> Raakhi
>>>
>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> Hi Lars,
>>>>
>>>> Well the details are as follows:
>>>>
>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>> below:
>>>>
>>>> one columnFamily called content and
>>>> one columnFamily called status [which takes the values ANALYSED,
>>>> UNANALYSED] (all in upper case... i checked it, there is no issue with
>>>> the
>>>> spelling/case).
>>>>
>>>> Hope this helps,
>>>> Thanks.
>>>> Rakhi
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <lars@worldlingo.com> wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi Rakhi,
>>>>>
>>>>> Wow, same here. I copied your RowFilter line and when I press the dot
>>>>> key
>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>
>>>>> Apart from that, you are also saying that the filter is not working as
>>>>> expected? Do you use any column qualifiers for the "Status:" column?
Are
>>>>> the
>>>>> values in the correct casing, i.e. are the values stored in uppercase
as
>>>>> you
>>>>> have it in your example below? I assume the comparison is byte
>>>>> sensitive.
>>>>> Please give us more details, maybe a small sample table dump so that
we
>>>>> can
>>>>> test this?
>>>>>
>>>>> Lars
>>>>>
>>>>> Rakhi Khatwani wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>          I did try the filter... but using ColumnValueFilter. i
>>>>>> declared
>>>>>> a
>>>>>> ColumnValueFilter as follows:
>>>>>>
>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>   implements JobConfigurable {
>>>>>>
>>>>>>            public void configure(final JobConf jobConf) {
>>>>>>
>>>>>>           setHtable(tablename);
>>>>>>
>>>>>>           setInputColumns(columnName);
>>>>>>
>>>>>>
>>>>>>            final RowFilterInterface colFilter =
>>>>>>                                                new
>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>> "UNCOLLECTED".getBytes());
>>>>>>              setRowFilter(colFilter);
>>>>>>  }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> and thn i use my class as the input format to my map function.
>>>>>>
>>>>>>
>>>>>> in my map function, i set my log to display the value of my Status
>>>>>> Column
>>>>>> family.
>>>>>>
>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>> Uncollected"
>>>>>> for some rows
>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>
>>>>>> but what i want is to send only those records whose 'Status: is
>>>>>> uncollected'.
>>>>>>
>>>>>> i even considered using the method filterRow described by the API
as
>>>>>> follows:
>>>>>>  boolean *filterRow<
>>>>>>
>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>>>        *(SortedMap<
>>>>>>
>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>>>        <byte[],Cell<
>>>>>>
>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> columns)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>         Filter on the fully assembled row.
>>>>>>
>>>>>> but as soon as i type colFilter followed by a '.', my eclipse hangs.
>>>>>> its really weird... i have tried it on 3 different machines (2 machines
>>>>>> on
>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>> myEclipse).
>>>>>>
>>>>>>
>>>>>> i dunno if i am going wrong somewhere
>>>>>>
>>>>>> Thanks,
>>>>>> Raakhi
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <lars@worldlingo.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Rakhi,
>>>>>>>
>>>>>>> The way the filters work is that you either use the supplied
filters
>>>>>>> or
>>>>>>> create your own subclasses - but then you will have to deploy
that
>>>>>>> class
>>>>>>> to
>>>>>>> all RegionServers while adding it to their respective hbase-env.sh
(in
>>>>>>> the
>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently
if
>>>>>>> this
>>>>>>> could be done dynamically (
>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>
>>>>>>> Once you have that done or use one of the supplied one then you
can
>>>>>>> assign
>>>>>>> the filter by overriding the TableInputFormat's configure() method
and
>>>>>>> assign it like so:
>>>>>>>
>>>>>>>  public void configure(JobConf job) {
>>>>>>>   RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>   setRowFilter(filter);
>>>>>>>  }
>>>>>>>
>>>>>>> As Tim points out, setting the whole thing up is done in your
main M/R
>>>>>>> tool
>>>>>>> based application, similar to:
>>>>>>>
>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
>>>>>>> IdentityTableMap.class,
>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>
>>>>>>> Of course depending on what classes you want to replace or if
this is
>>>>>>> a
>>>>>>> Reduce oriented job (means a default identity + filter map and
all the
>>>>>>> work
>>>>>>> done in the Reduce phase) or the other way around. But the principles
>>>>>>> and
>>>>>>> filtering are the same.
>>>>>>>
>>>>>>> HTH,
>>>>>>> Lars
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Rakhi Khatwani wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>
>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> there is a server-side mechanism to filter rows, it's
found in the
>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure
how this
>>>>>>>>> interops
>>>>>>>>> with
>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>
>>>>>>>>> setting a filter to reduce the # of rows returned is
pretty much
>>>>>>>>> exactly
>>>>>>>>> what you want.
>>>>>>>>>
>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> wrote:
>>>>>>>>>>    Hi,
>>>>>>>>>>  i have a map reduce program with which i read from
a hbase table.
>>>>>>>>>> In my map program i check if the column value of
a is xxx, if yes
>>>>>>>>>> then
>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>> however if my table is really big, most of my time
in the map gets
>>>>>>>>>> wasted
>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>> is there any way through which we could send a subset
of rows
>>>>>>>>>> (based
>>>>>>>>>> on
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>>
>>>>>>>>>> i have also gone through TableInputFormatBase but
am not able to
>>>>>>>>>> figure
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> out
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> how do we set the input format if we are using TableMapReduceUtil
>>>>>>>>>> class
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> initialize table map jobs. or is there any other
way i could use
>>>>>>>>>> it.
>>>>>>>>>>
>>>>>>>>>> Thanks in Advance,
>>>>>>>>>> Raakhi.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>       
>
>   

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message