hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: help with map-reduce
Date Thu, 09 Apr 2009 15:58:48 GMT
Hi Rakhi,

The second part was meant to say: "...Setting it to *false*activates 
the...", so call it like this:


 final RowFilterInterface colFilter = new 
ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
  "UNCOLLECTED".getBytes(), false);

Regards,
Lars

PS: And sorry for my misspelling of your name


Lars George wrote:
> Hi Rahki,
>
> Looking through the code of the ColumnValueFilter again, it seems it 
> does what you want when you add the extra "filterIfColumnMissing" 
> parameter to the constructor and set it to "false". The default "true" 
> does the column filtering and will return all rows that have that 
> column. Setting it to true activates the "filterRow()" (although I am 
> not sure yet where that is called - the others I can see in the 
> StoreScanner class in use) to filter rows out that do not have a 
> column match - which is what you want. Of course you still need to 
> invert the check as mentioned in the previous email.
>
> Lars
>
> Rakhi Khatwani wrote:
>> Hi Lars,
>>                  Hmm... i had a look at other filters.. but i thought
>> ColumnValueFilter would be more appropriate coz in the constructor we 
>> could
>> mention the column name and the value.
>> Probably i am going wrong there.
>>
>> what i want is to filter out all the rows based on some column value. 
>> what
>> do you suggest??.
>>
>> thanks a ton
>> Rakhi
>>
>> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <lars@worldlingo.com> 
>> wrote:
>>
>>  
>>> Hi Rakhi,
>>>
>>> Sorry, not yet. This is not an easy thing to replicate. I will try 
>>> though
>>> over the next few days if I find time. A few things to note though 
>>> first.
>>> The way filters work is that they do *not* let filtered rows through 
>>> but
>>> actually filters them out. That means you logic seems reversed:
>>>
>>>  final RowFilterInterface colFilter = new
>>> ColumnValueFilter("Status:".getBytes(), 
>>> ColumnValueFilter.CompareOp.EQUAL,
>>>   "UNCOLLECTED".getBytes());
>>>  setRowFilter(colFilter);
>>>
>>>
>>> I think you *want* the uncollected columns to be processed? At least 
>>> that
>>> is what you said below :) So you will have to filter all other rows 
>>> out of
>>> the set that are NOT EQUAL to "UNCOLLECTED".
>>>
>>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you 
>>> systems
>>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that 
>>> should
>>> of course match the way you store those strings in the first place. The
>>> filters do a byte level compare so that is very sensitive.
>>>
>>> This does not address yet why you see both values or have matches at 
>>> all.
>>> It rather sounds like the filter is not active?
>>>
>>> And lastly, using the ColumnValueFilter will always let throw all 
>>> rows! It
>>> is designed to strip out the columns of each row, but not filter on 
>>> the row
>>> itself. Is that what you want? If not you may have to use a 
>>> different filter
>>> class.
>>>
>>>
>>> Lars
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>    
>>>> Hi Lars,
>>>>              Just wanted to follow up, did you try out the column 
>>>> value
>>>> filter? did it work??
>>>> i really need it to improve the performance of my map-reduce programs.
>>>>
>>>> Thanks a ton,
>>>> Raakhi
>>>>
>>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani 
>>>> <rakhi.khatwani@gmail.com
>>>>      
>>>>> wrote:
>>>>>         
>>>>
>>>>      
>>>>> Hi Lars,
>>>>>
>>>>> Well the details are as follows:
>>>>>
>>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>>> below:
>>>>>
>>>>> one columnFamily called content and
>>>>> one columnFamily called status [which takes the values ANALYSED,
>>>>> UNANALYSED] (all in upper case... i checked it, there is no issue 
>>>>> with
>>>>> the
>>>>> spelling/case).
>>>>>
>>>>> Hope this helps,
>>>>> Thanks.
>>>>> Rakhi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <lars@worldlingo.com>

>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>        
>>>>>> Hi Rakhi,
>>>>>>
>>>>>> Wow, same here. I copied your RowFilter line and when I press the

>>>>>> dot
>>>>>> key
>>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>>
>>>>>> Apart from that, you are also saying that the filter is not 
>>>>>> working as
>>>>>> expected? Do you use any column qualifiers for the "Status:" 
>>>>>> column? Are
>>>>>> the
>>>>>> values in the correct casing, i.e. are the values stored in 
>>>>>> uppercase as
>>>>>> you
>>>>>> have it in your example below? I assume the comparison is byte
>>>>>> sensitive.
>>>>>> Please give us more details, maybe a small sample table dump so 
>>>>>> that we
>>>>>> can
>>>>>> test this?
>>>>>>
>>>>>> Lars
>>>>>>
>>>>>> Rakhi Khatwani wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>          
>>>>>>> Hi,
>>>>>>>          I did try the filter... but using ColumnValueFilter.
i
>>>>>>> declared
>>>>>>> a
>>>>>>> ColumnValueFilter as follows:
>>>>>>>
>>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>>   implements JobConfigurable {
>>>>>>>
>>>>>>>            public void configure(final JobConf jobConf) {
>>>>>>>
>>>>>>>           setHtable(tablename);
>>>>>>>
>>>>>>>           setInputColumns(columnName);
>>>>>>>
>>>>>>>
>>>>>>>            final RowFilterInterface colFilter =
>>>>>>>                                                new
>>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>>> "UNCOLLECTED".getBytes());
>>>>>>>              setRowFilter(colFilter);
>>>>>>>  }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> and thn i use my class as the input format to my map function.
>>>>>>>
>>>>>>>
>>>>>>> in my map function, i set my log to display the value of my Status
>>>>>>> Column
>>>>>>> family.
>>>>>>>
>>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>>> Uncollected"
>>>>>>> for some rows
>>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>>
>>>>>>> but what i want is to send only those records whose 'Status:
is
>>>>>>> uncollected'.
>>>>>>>
>>>>>>> i even considered using the method filterRow described by the

>>>>>>> API as
>>>>>>> follows:
>>>>>>>  boolean *filterRow<
>>>>>>>
>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29

>>>>>>>
>>>>>>>        *(SortedMap<
>>>>>>>
>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true

>>>>>>>
>>>>>>>        <byte[],Cell<
>>>>>>>
>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html

>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>            
>>>>>>>> columns)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>>         Filter on the fully assembled row.
>>>>>>>
>>>>>>> but as soon as i type colFilter followed by a '.', my eclipse

>>>>>>> hangs.
>>>>>>> its really weird... i have tried it on 3 different machines (2

>>>>>>> machines
>>>>>>> on
>>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>>> myEclipse).
>>>>>>>
>>>>>>>
>>>>>>> i dunno if i am going wrong somewhere
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Raakhi
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <lars@worldlingo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>            
>>>>>>>> Hi Rakhi,
>>>>>>>>
>>>>>>>> The way the filters work is that you either use the supplied

>>>>>>>> filters
>>>>>>>> or
>>>>>>>> create your own subclasses - but then you will have to deploy
that
>>>>>>>> class
>>>>>>>> to
>>>>>>>> all RegionServers while adding it to their respective 
>>>>>>>> hbase-env.sh (in
>>>>>>>> the
>>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing currently
if
>>>>>>>> this
>>>>>>>> could be done dynamically (
>>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>>
>>>>>>>> Once you have that done or use one of the supplied one then
you 
>>>>>>>> can
>>>>>>>> assign
>>>>>>>> the filter by overriding the TableInputFormat's configure()

>>>>>>>> method and
>>>>>>>> assign it like so:
>>>>>>>>
>>>>>>>>  public void configure(JobConf job) {
>>>>>>>>   RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>>   setRowFilter(filter);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> As Tim points out, setting the whole thing up is done in
your 
>>>>>>>> main M/R
>>>>>>>> tool
>>>>>>>> based application, similar to:
>>>>>>>>
>>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>",
"<colums>",
>>>>>>>> IdentityTableMap.class,
>>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>>
>>>>>>>> Of course depending on what classes you want to replace or
if 
>>>>>>>> this is
>>>>>>>> a
>>>>>>>> Reduce oriented job (means a default identity + filter map
and 
>>>>>>>> all the
>>>>>>>> work
>>>>>>>> done in the Reduce phase) or the other way around. But the

>>>>>>>> principles
>>>>>>>> and
>>>>>>>> filtering are the same.
>>>>>>>>
>>>>>>>> HTH,
>>>>>>>> Lars
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Rakhi Khatwani wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>              
>>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>>
>>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                
>>>>>>>>>> there is a server-side mechanism to filter rows,
it's found 
>>>>>>>>>> in the
>>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not sure
how this
>>>>>>>>>> interops
>>>>>>>>>> with
>>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>>
>>>>>>>>>> setting a filter to reduce the # of rows returned
is pretty much
>>>>>>>>>> exactly
>>>>>>>>>> what you want.
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
>>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> wrote:
>>>>>>>>>>>    Hi,
>>>>>>>>>>>  i have a map reduce program with which i read
from a hbase 
>>>>>>>>>>> table.
>>>>>>>>>>> In my map program i check if the column value
of a is xxx, 
>>>>>>>>>>> if yes
>>>>>>>>>>> then
>>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>>> however if my table is really big, most of my
time in the 
>>>>>>>>>>> map gets
>>>>>>>>>>> wasted
>>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>>> is there any way through which we could send
a subset of rows
>>>>>>>>>>> (based
>>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> value of a particular column family) to the map???
>>>>>>>>>>>
>>>>>>>>>>> i have also gone through TableInputFormatBase
but am not 
>>>>>>>>>>> able to
>>>>>>>>>>> figure
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> out
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> how do we set the input format if we are using

>>>>>>>>>>> TableMapReduceUtil
>>>>>>>>>>> class
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                  
>>>>>>>>>>> initialize table map jobs. or is there any other
way i could 
>>>>>>>>>>> use
>>>>>>>>>>> it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in Advance,
>>>>>>>>>>> Raakhi.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>       
>>
>>   

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message