hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakhi Khatwani <rakhi.khatw...@gmail.com>
Subject Re: help with map-reduce
Date Thu, 09 Apr 2009 16:12:14 GMT
Hi Lars,
            thanks for your suggesstion... i will try this out 2day :)

thanks once again
Rakhi

On Thu, Apr 9, 2009 at 6:58 PM, Lars George <lars@worldlingo.com> wrote:

> Hi Rakhi,
>
> The second part was meant to say: "...Setting it to *false*activates
> the...", so call it like this:
>
>
> final RowFilterInterface colFilter = new
> ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
>  "UNCOLLECTED".getBytes(), false);
>
> Regards,
> Lars
>
> PS: And sorry for my misspelling of your name
>
>
>
> Lars George wrote:
>
>> Hi Rahki,
>>
>> Looking through the code of the ColumnValueFilter again, it seems it does
>> what you want when you add the extra "filterIfColumnMissing" parameter to
>> the constructor and set it to "false". The default "true" does the column
>> filtering and will return all rows that have that column. Setting it to true
>> activates the "filterRow()" (although I am not sure yet where that is called
>> - the others I can see in the StoreScanner class in use) to filter rows out
>> that do not have a column match - which is what you want. Of course you
>> still need to invert the check as mentioned in the previous email.
>>
>> Lars
>>
>> Rakhi Khatwani wrote:
>>
>>> Hi Lars,
>>>                 Hmm... i had a look at other filters.. but i thought
>>> ColumnValueFilter would be more appropriate coz in the constructor we
>>> could
>>> mention the column name and the value.
>>> Probably i am going wrong there.
>>>
>>> what i want is to filter out all the rows based on some column value.
>>> what
>>> do you suggest??.
>>>
>>> thanks a ton
>>> Rakhi
>>>
>>> On Thu, Apr 9, 2009 at 11:46 AM, Lars George <lars@worldlingo.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hi Rakhi,
>>>>
>>>> Sorry, not yet. This is not an easy thing to replicate. I will try
>>>> though
>>>> over the next few days if I find time. A few things to note though
>>>> first.
>>>> The way filters work is that they do *not* let filtered rows through but
>>>> actually filters them out. That means you logic seems reversed:
>>>>
>>>>  final RowFilterInterface colFilter = new
>>>> ColumnValueFilter("Status:".getBytes(),
>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>  "UNCOLLECTED".getBytes());
>>>>  setRowFilter(colFilter);
>>>>
>>>>
>>>> I think you *want* the uncollected columns to be processed? At least
>>>> that
>>>> is what you said below :) So you will have to filter all other rows out
>>>> of
>>>> the set that are NOT EQUAL to "UNCOLLECTED".
>>>>
>>>> Second, be careful with "UNCOLLECTED".getBytes() as that uses you
>>>> systems
>>>> default encoding. Better use Bytes.toBytes("UNCOLLECTED") - but that
>>>> should
>>>> of course match the way you store those strings in the first place. The
>>>> filters do a byte level compare so that is very sensitive.
>>>>
>>>> This does not address yet why you see both values or have matches at
>>>> all.
>>>> It rather sounds like the filter is not active?
>>>>
>>>> And lastly, using the ColumnValueFilter will always let throw all rows!
>>>> It
>>>> is designed to strip out the columns of each row, but not filter on the
>>>> row
>>>> itself. Is that what you want? If not you may have to use a different
>>>> filter
>>>> class.
>>>>
>>>>
>>>> Lars
>>>>
>>>>
>>>> Rakhi Khatwani wrote:
>>>>
>>>>
>>>>
>>>>> Hi Lars,
>>>>>             Just wanted to follow up, did you try out the column value
>>>>> filter? did it work??
>>>>> i really need it to improve the performance of my map-reduce programs.
>>>>>
>>>>> Thanks a ton,
>>>>> Raakhi
>>>>>
>>>>> On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani <
>>>>> rakhi.khatwani@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Lars,
>>>>>>
>>>>>> Well the details are as follows:
>>>>>>
>>>>>> table1 has the rowkey as some url, and 2 ColumnFamilies as described
>>>>>> below:
>>>>>>
>>>>>> one columnFamily called content and
>>>>>> one columnFamily called status [which takes the values ANALYSED,
>>>>>> UNANALYSED] (all in upper case... i checked it, there is no issue
with
>>>>>> the
>>>>>> spelling/case).
>>>>>>
>>>>>> Hope this helps,
>>>>>> Thanks.
>>>>>> Rakhi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 8, 2009 at 1:59 PM, Lars George <lars@worldlingo.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Rakhi,
>>>>>>>
>>>>>>> Wow, same here. I copied your RowFilter line and when I press
the dot
>>>>>>> key
>>>>>>> and the fly up opens Eclipse hangs. Nice... NOT!
>>>>>>>
>>>>>>> Apart from that, you are also saying that the filter is not working
>>>>>>> as
>>>>>>> expected? Do you use any column qualifiers for the "Status:"
column?
>>>>>>> Are
>>>>>>> the
>>>>>>> values in the correct casing, i.e. are the values stored in uppercase
>>>>>>> as
>>>>>>> you
>>>>>>> have it in your example below? I assume the comparison is byte
>>>>>>> sensitive.
>>>>>>> Please give us more details, maybe a small sample table dump
so that
>>>>>>> we
>>>>>>> can
>>>>>>> test this?
>>>>>>>
>>>>>>> Lars
>>>>>>>
>>>>>>> Rakhi Khatwani wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>         I did try the filter... but using ColumnValueFilter.
i
>>>>>>>> declared
>>>>>>>> a
>>>>>>>> ColumnValueFilter as follows:
>>>>>>>>
>>>>>>>> public class TableInputFilter extends TableInputFormat
>>>>>>>>  implements JobConfigurable {
>>>>>>>>
>>>>>>>>           public void configure(final JobConf jobConf) {
>>>>>>>>
>>>>>>>>          setHtable(tablename);
>>>>>>>>
>>>>>>>>          setInputColumns(columnName);
>>>>>>>>
>>>>>>>>
>>>>>>>>           final RowFilterInterface colFilter =
>>>>>>>>                                               new
>>>>>>>> ColumnValueFilter("Status:".getBytes(),
>>>>>>>> ColumnValueFilter.CompareOp.EQUAL,
>>>>>>>> "UNCOLLECTED".getBytes());
>>>>>>>>             setRowFilter(colFilter);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> and thn i use my class as the input format to my map function.
>>>>>>>>
>>>>>>>>
>>>>>>>> in my map function, i set my log to display the value of
my Status
>>>>>>>> Column
>>>>>>>> family.
>>>>>>>>
>>>>>>>> when i execute my map reduce function, it displays "Status::
>>>>>>>> Uncollected"
>>>>>>>> for some rows
>>>>>>>> and Status = "Collected" for rest of the rows.
>>>>>>>>
>>>>>>>> but what i want is to send only those records whose 'Status:
is
>>>>>>>> uncollected'.
>>>>>>>>
>>>>>>>> i even considered using the method filterRow described by
the API as
>>>>>>>> follows:
>>>>>>>>  boolean *filterRow<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
>>>>>>>>       *(SortedMap<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
>>>>>>>>       <byte[],Cell<
>>>>>>>>
>>>>>>>>
>>>>>>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> columns)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>        Filter on the fully assembled row.
>>>>>>>>
>>>>>>>> but as soon as i type colFilter followed by a '.', my eclipse
hangs.
>>>>>>>> its really weird... i have tried it on 3 different machines
(2
>>>>>>>> machines
>>>>>>>> on
>>>>>>>> linux running eclipse gannymade 3.4 and one on windows using
>>>>>>>> myEclipse).
>>>>>>>>
>>>>>>>>
>>>>>>>> i dunno if i am going wrong somewhere
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Raakhi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 7, 2009 at 7:18 PM, Lars George <lars@worldlingo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Rakhi,
>>>>>>>>>
>>>>>>>>> The way the filters work is that you either use the supplied
>>>>>>>>> filters
>>>>>>>>> or
>>>>>>>>> create your own subclasses - but then you will have to
deploy that
>>>>>>>>> class
>>>>>>>>> to
>>>>>>>>> all RegionServers while adding it to their respective
hbase-env.sh
>>>>>>>>> (in
>>>>>>>>> the
>>>>>>>>> "export HBASE_CLASSPATH" variable). We are discussing
currently if
>>>>>>>>> this
>>>>>>>>> could be done dynamically (
>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-1288).
>>>>>>>>>
>>>>>>>>> Once you have that done or use one of the supplied one
then you can
>>>>>>>>> assign
>>>>>>>>> the filter by overriding the TableInputFormat's configure()
method
>>>>>>>>> and
>>>>>>>>> assign it like so:
>>>>>>>>>
>>>>>>>>>  public void configure(JobConf job) {
>>>>>>>>>  RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>>>>>>>>>  setRowFilter(filter);
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> As Tim points out, setting the whole thing up is done
in your main
>>>>>>>>> M/R
>>>>>>>>> tool
>>>>>>>>> based application, similar to:
>>>>>>>>>
>>>>>>>>>  JobConf job = new JobConf(...);
>>>>>>>>>  TableMapReduceUtil.initTableMapJob("<table-name>",
"<colums>",
>>>>>>>>> IdentityTableMap.class,
>>>>>>>>>  ImmutableBytesWritable.class, RowResult.class, job);
>>>>>>>>>  job.setReducerClass(MyTableReduce.class);
>>>>>>>>>  job.setInputFormat(MyTableInputFormat.class);
>>>>>>>>>  job.setOutputFormat(MyTableOutputFormat.class);
>>>>>>>>>
>>>>>>>>> Of course depending on what classes you want to replace
or if this
>>>>>>>>> is
>>>>>>>>> a
>>>>>>>>> Reduce oriented job (means a default identity + filter
map and all
>>>>>>>>> the
>>>>>>>>> work
>>>>>>>>> done in the Reduce phase) or the other way around. But
the
>>>>>>>>> principles
>>>>>>>>> and
>>>>>>>>> filtering are the same.
>>>>>>>>>
>>>>>>>>> HTH,
>>>>>>>>> Lars
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Rakhi Khatwani wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks Ryan, i will try that
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> there is a server-side mechanism to filter rows,
it's found in
>>>>>>>>>>> the
>>>>>>>>>>> org.apache.hadoop.hbase.filter package.  im not
sure how this
>>>>>>>>>>> interops
>>>>>>>>>>> with
>>>>>>>>>>> the TableInputFormat exactly.
>>>>>>>>>>>
>>>>>>>>>>> setting a filter to reduce the # of rows returned
is pretty much
>>>>>>>>>>> exactly
>>>>>>>>>>> what you want.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani
<
>>>>>>>>>>> rakhi.khatwani@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>   Hi,
>>>>>>>>>>>>  i have a map reduce program with which i
read from a hbase
>>>>>>>>>>>> table.
>>>>>>>>>>>> In my map program i check if the column value
of a is xxx, if
>>>>>>>>>>>> yes
>>>>>>>>>>>> then
>>>>>>>>>>>> continue with processing else skip it.
>>>>>>>>>>>> however if my table is really big, most of
my time in the map
>>>>>>>>>>>> gets
>>>>>>>>>>>> wasted
>>>>>>>>>>>> for processing unwanted rows.
>>>>>>>>>>>> is there any way through which we could send
a subset of rows
>>>>>>>>>>>> (based
>>>>>>>>>>>> on
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> value of a particular column family) to the
map???
>>>>>>>>>>>>
>>>>>>>>>>>> i have also gone through TableInputFormatBase
but am not able to
>>>>>>>>>>>> figure
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> out
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> how do we set the input format if we are
using
>>>>>>>>>>>> TableMapReduceUtil
>>>>>>>>>>>> class
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> initialize table map jobs. or is there any
other way i could use
>>>>>>>>>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in Advance,
>>>>>>>>>>>> Raakhi.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message