hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigertail <tyc...@yahoo.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 19:56:03 GMT

FYI, I also tried to implement my subclass of TableInputFormat and in
configure method I called 

    byte [] filName = Bytes.toBytes("f1:age");
    byte [] filValue = Bytes.toBytes("30");
    ColumnValueFilter filter = new
ColumnValueFilter(colName,ColumnValueFilter.CompareOp.EQUAL, colValue);
    setRowFilter(filter);

But as I said in my 1st post, it seems to be even slower than reading all
rows.


tigertail wrote:
> 
> Erik,
> 
> As far as I know, the column filtering happens in TableInputFormatBase. We
> can use setInputColumns to assign the column we want to return. And then
> TableInputFormatBase will start a scanner for the column.
> 
> Yes, we can use "age" as family and each age value as a column. But can it
> avoid reading all rows like the following code does?
> 
>   public void map(ImmutableBytesWritable row, RowResult value,
>     OutputCollector<Text,Text> output,
>     @SuppressWarnings("unused") Reporter reporter)
>   throws IOException {
> 
> 	Cell cell = value.get("age:30".getBytes());
>     if (cell==null)
>     {
>       return;
>     }
> ...
> }
> 
> Erik Holstad wrote:
>> 
>> Hi Tigertail!
>> I have written some MR jobs earlier but nothing fancy like implementing
>> your
>> own filter like
>> you, but what I do know it that you can specify the columns that you want
>> to
>> read as the
>> input to the maptask. But since I'm not sure how that filter process is
>> handled internally I
>> can say if it reads in all the columns and than filter them out or how it
>> actually does it, please
>> let me know how it works, you people out there that have this knowledge
>> :).
>> 
>> But you could try to have a column family age: and than have 1 column for
>> every age that you
>> want to be able to specify, for example age:30 or something, so you don't
>> have to look at the
>> value of the column, but rather using the column itself as the key.
>> 
>> Hope that it helped you a little but, and please let me know what kind of
>> results that you come up with.
>> 
>> Regards Erik
>> 
>> On Thu, Dec 18, 2008 at 9:26 AM, tigertail <tyczjs@yahoo.com> wrote:
>> 
>>>
>>> Thanks Erik,
>>>
>>> What I want is either by row key values, or by a specific value in a
>>> column,
>>> to quickly return a small subset without reading all records into
>>> Mapper.
>>> So
>>> I actually have two questions :)
>>>
>>> For the column-based search, for example, I have 1 billion people
>>> records
>>> in
>>> the table, the row key is the "name", and there is an "age" column. Now
>>> I
>>> want to find the records with age=30. How can I avoid to read every
>>> record
>>> into mapper and then filter the output?
>>>
>>> For searching by row key values, let's suppose I have 1 million people's
>>> names. Is there a more efficient way than running 1 million times
>>> table.getRow(name), in case the "name" strings are randomly distributed
>>> (and
>>> hence it is useless to write a new getSplits)?
>>>
>>> >> Did you try to only put that column in there for the rows that you
>>> want
>>> >> to
>>> >> get and use that as an input
>>> >> to the MR?
>>>
>>> I am not sure I get you there. I can use
>>> TableInputFormatBase.setInputColums
>>> in my program to only return the "age' column, but still, I need to read
>>> every row from the table into mapper. Or my understanding is wrong, can
>>> you
>>> give more details on your thought?
>>>
>>> Thanks again.
>>>
>>>
>>>
>>> Erik Holstad wrote:
>>> >
>>> > Hi Tigertail!
>>> > Not sure if I understand you original problem correct, but it seemed
>>> to
>>> me
>>> > that you wanted to just get
>>> > the rows with the value 1 in a column, right?
>>> >
>>> > Did you try to only put that column in there for the rows that you
>>> want
>>> to
>>> > get and use that as an input
>>> > to the MR?
>>> >
>>> > I haven't timed my MR jobs with this approach so I'm not sure how it
>>> is
>>> > handled internally, but maybe
>>> > it is worth giving it a try.
>>> >
>>> > Regards Erik
>>> >
>>> > On Wed, Dec 17, 2008 at 8:37 PM, tigertail <tyczjs@yahoo.com> wrote:
>>> >
>>> >>
>>> >> Hi St. Ack,
>>> >>
>>> >> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each
4
>>> >> CPUs).
>>> >> Suppose the 1M row keys are known beforehand and saved in an file, I
>>> just
>>> >> read each key into a mapper and use table.getRow(key) to get the
>>> record.
>>> >> I
>>> >> also tried to increase the # of map tasks, but it did not improve the
>>> >> performance. Actually, even worse. Many tasks are failed/killed with
>>> sth
>>> >> like "no response in 600 seconds."
>>> >>
>>> >>
>>> >> stack-3 wrote:
>>> >> >
>>> >> > For A2. below, how many map tasks?  How did you split the 1M you
>>> wanted
>>> >> > to fetch? How many of them ran concurrently?
>>> >> > St.Ack
>>> >> >
>>> >> >
>>> >> > tigertail wrote:
>>> >> >> Hi, can anybody help? Hopefully the following can be helpful
to
>>> make
>>> >> my
>>> >> >> question clear if it was not in my last post.
>>> >> >>
>>> >> >> A1. I created a table in HBase and then I inserted 10 million
>>> records
>>> >> >> into
>>> >> >> the table.
>>> >> >> A2. I ran a M/R program with totally 10 million "get by rowkey"
>>> >> operation
>>> >> >> to
>>> >> >> read the 10M records out and it took about 3 hours to finish.
>>> >> >> A3. I also ran a M/R program which used TableMap to read the
10M
>>> >> records
>>> >> >> out
>>> >> >> and it just took 12 minutes.
>>> >> >>
>>> >> >> Now suppose I only need to read 1 million records whose row
keys
>>> are
>>> >> >> known
>>> >> >> beforehand (and let's suppose the worst case the 1M records
are
>>> evenly
>>> >> >> distributed in the 10M records).
>>> >> >>
>>> >> >> S1. I can use 1M "get by rowkey". But it is slow.
>>> >> >> S2. I can also simply use TableMap and only output the 10M
records
>>> in
>>> >> the
>>> >> >> map function but it actually read the whole table.
>>> >> >>
>>> >> >> Q1. Is there some more efficient way to read the 1M records,
>>> WITHOUT
>>> >> >> PASSING
>>> >> >> THOUGH THE WHOLE TABLE?
>>> >> >>
>>> >> >> How about if I have 1 billion records in an HBase table and
I only
>>> >> need
>>> >> >> to
>>> >> >> read 1 million records in the following two scenarios.
>>> >> >>
>>> >> >> Q2. suppose their row keys are known beforehand
>>> >> >> Q3. or suppose these 1 million records have the same value
on a
>>> column
>>> >> >>
>>> >> >> Any input would be greatly appreciated. Thank you so much!
>>> >> >>
>>> >> >>
>>> >> >> tigertail wrote:
>>> >> >>
>>> >> >>> For example, I have a HBase table with 1 billion records.
Each
>>> record
>>> >> >>> has
>>> >> >>> a column named 'f1:testcol'. And I want to only get the
records
>>> with
>>> >> >>> 'f1:testcol'=0 as the input to my map function. Suppose
there are
>>> 1
>>> >> >>> million such records, I would expect this would be must
faster
>>> than
>>> I
>>> >> >>> get
>>> >> >>> all 1 billion records into my map function and then do
condition
>>> >> check.
>>> >> >>>
>>> >> >>> By searching on this board and HBase documents, I tried
to
>>> implement
>>> >> my
>>> >> >>> own subclass of TableInputFormat and set a ColumnValueFilter
in
>>> >> >>> configure
>>> >> >>> method.
>>> >> >>>
>>> >> >>> public class TableInputFilterFormat extends TableInputFormat
>>> >> implements
>>> >> >>>     JobConfigurable {
>>> >> >>>   private final Log LOG =
>>> >> >>> LogFactory.getLog(TableInputFilterFormat.class);
>>> >> >>>
>>> >> >>>   public static final String FILTER_LIST =
>>> >> "hbase.mapred.tablefilters";
>>> >> >>>
>>> >> >>>   public void configure(JobConf job) {
>>> >> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>>> >> >>>
>>> >> >>>     String colArg = job.get(COLUMN_LIST);
>>> >> >>>     String[] colNames = colArg.split(" ");
>>> >> >>>     byte [][] m_cols = new byte[colNames.length][];
>>> >> >>>     for (int i = 0; i < m_cols.length; i++) {
>>> >> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
>>> >> >>>     }
>>> >> >>>     setInputColums(m_cols);
>>> >> >>>
>>> >> >>>     ColumnValueFilter filter = new
>>> >> >>>
>>> >>
>>> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
>>> >> >>> Bytes.toBytes("0"));
>>> >> >>>     setRowFilter(filter);
>>> >> >>>
>>> >> >>>     try {
>>> >> >>>       setHTable(new HTable(new HBaseConfiguration(job),
>>> >> >>> tableNames[0].getName()));
>>> >> >>>     } catch (Exception e) {
>>> >> >>>       LOG.error(e);
>>> >> >>>     }
>>> >> >>>   }
>>> >> >>> }
>>> >> >>>
>>> >> >>> However, The M/R job with RowFilter is much slower than
the M/R
>>> job
>>> >> w/o
>>> >> >>> RowFilter. During the process many tasked are failed with
sth
>>> like
>>> >> "Task
>>> >> >>> attempt_200812091733_0063_m_000019_1 failed to report status
for
>>> 604
>>> >> >>> seconds. Killing!". I am wondering if RowFilter can really
>>> decrease
>>> >> the
>>> >> >>> record feeding from 1 billion to 1 million? If it cannot,
is
>>> there
>>> >> any
>>> >> >>> other method to address this issue?
>>> >> >>>
>>> >> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
>>> >> >>>
>>> >> >>> Thank you so much in advance!
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
>>> >> Sent from the HBase User mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21079808.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message