hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigertail <tyc...@yahoo.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 19:49:15 GMT

Erik,

As far as I know, the column filtering happens in TableInputFormatBase. We
can use setInputColumns to assign the column we want to return. And then
TableInputFormatBase will start a scanner for the column.

Yes, we can use "age" as family and each age value as a column. But can it
avoid reading all rows like the following code does?

  public void map(ImmutableBytesWritable row, RowResult value,
    OutputCollector<Text,Text> output,
    @SuppressWarnings("unused") Reporter reporter)
  throws IOException {

	Cell cell = value.get("age:30".getBytes());
    if (cell==null)
    {
      return;
    }
...
}

Erik Holstad wrote:
> 
> Hi Tigertail!
> I have written some MR jobs earlier but nothing fancy like implementing
> your
> own filter like
> you, but what I do know it that you can specify the columns that you want
> to
> read as the
> input to the maptask. But since I'm not sure how that filter process is
> handled internally I
> can say if it reads in all the columns and than filter them out or how it
> actually does it, please
> let me know how it works, you people out there that have this knowledge
> :).
> 
> But you could try to have a column family age: and than have 1 column for
> every age that you
> want to be able to specify, for example age:30 or something, so you don't
> have to look at the
> value of the column, but rather using the column itself as the key.
> 
> Hope that it helped you a little but, and please let me know what kind of
> results that you come up with.
> 
> Regards Erik
> 
> On Thu, Dec 18, 2008 at 9:26 AM, tigertail <tyczjs@yahoo.com> wrote:
> 
>>
>> Thanks Erik,
>>
>> What I want is either by row key values, or by a specific value in a
>> column,
>> to quickly return a small subset without reading all records into Mapper.
>> So
>> I actually have two questions :)
>>
>> For the column-based search, for example, I have 1 billion people records
>> in
>> the table, the row key is the "name", and there is an "age" column. Now I
>> want to find the records with age=30. How can I avoid to read every
>> record
>> into mapper and then filter the output?
>>
>> For searching by row key values, let's suppose I have 1 million people's
>> names. Is there a more efficient way than running 1 million times
>> table.getRow(name), in case the "name" strings are randomly distributed
>> (and
>> hence it is useless to write a new getSplits)?
>>
>> >> Did you try to only put that column in there for the rows that you
>> want
>> >> to
>> >> get and use that as an input
>> >> to the MR?
>>
>> I am not sure I get you there. I can use
>> TableInputFormatBase.setInputColums
>> in my program to only return the "age' column, but still, I need to read
>> every row from the table into mapper. Or my understanding is wrong, can
>> you
>> give more details on your thought?
>>
>> Thanks again.
>>
>>
>>
>> Erik Holstad wrote:
>> >
>> > Hi Tigertail!
>> > Not sure if I understand you original problem correct, but it seemed to
>> me
>> > that you wanted to just get
>> > the rows with the value 1 in a column, right?
>> >
>> > Did you try to only put that column in there for the rows that you want
>> to
>> > get and use that as an input
>> > to the MR?
>> >
>> > I haven't timed my MR jobs with this approach so I'm not sure how it is
>> > handled internally, but maybe
>> > it is worth giving it a try.
>> >
>> > Regards Erik
>> >
>> > On Wed, Dec 17, 2008 at 8:37 PM, tigertail <tyczjs@yahoo.com> wrote:
>> >
>> >>
>> >> Hi St. Ack,
>> >>
>> >> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
>> >> CPUs).
>> >> Suppose the 1M row keys are known beforehand and saved in an file, I
>> just
>> >> read each key into a mapper and use table.getRow(key) to get the
>> record.
>> >> I
>> >> also tried to increase the # of map tasks, but it did not improve the
>> >> performance. Actually, even worse. Many tasks are failed/killed with
>> sth
>> >> like "no response in 600 seconds."
>> >>
>> >>
>> >> stack-3 wrote:
>> >> >
>> >> > For A2. below, how many map tasks?  How did you split the 1M you
>> wanted
>> >> > to fetch? How many of them ran concurrently?
>> >> > St.Ack
>> >> >
>> >> >
>> >> > tigertail wrote:
>> >> >> Hi, can anybody help? Hopefully the following can be helpful to
>> make
>> >> my
>> >> >> question clear if it was not in my last post.
>> >> >>
>> >> >> A1. I created a table in HBase and then I inserted 10 million
>> records
>> >> >> into
>> >> >> the table.
>> >> >> A2. I ran a M/R program with totally 10 million "get by rowkey"
>> >> operation
>> >> >> to
>> >> >> read the 10M records out and it took about 3 hours to finish.
>> >> >> A3. I also ran a M/R program which used TableMap to read the 10M
>> >> records
>> >> >> out
>> >> >> and it just took 12 minutes.
>> >> >>
>> >> >> Now suppose I only need to read 1 million records whose row keys
>> are
>> >> >> known
>> >> >> beforehand (and let's suppose the worst case the 1M records are
>> evenly
>> >> >> distributed in the 10M records).
>> >> >>
>> >> >> S1. I can use 1M "get by rowkey". But it is slow.
>> >> >> S2. I can also simply use TableMap and only output the 10M records
>> in
>> >> the
>> >> >> map function but it actually read the whole table.
>> >> >>
>> >> >> Q1. Is there some more efficient way to read the 1M records,
>> WITHOUT
>> >> >> PASSING
>> >> >> THOUGH THE WHOLE TABLE?
>> >> >>
>> >> >> How about if I have 1 billion records in an HBase table and I only
>> >> need
>> >> >> to
>> >> >> read 1 million records in the following two scenarios.
>> >> >>
>> >> >> Q2. suppose their row keys are known beforehand
>> >> >> Q3. or suppose these 1 million records have the same value on a
>> column
>> >> >>
>> >> >> Any input would be greatly appreciated. Thank you so much!
>> >> >>
>> >> >>
>> >> >> tigertail wrote:
>> >> >>
>> >> >>> For example, I have a HBase table with 1 billion records. Each
>> record
>> >> >>> has
>> >> >>> a column named 'f1:testcol'. And I want to only get the records
>> with
>> >> >>> 'f1:testcol'=0 as the input to my map function. Suppose there
are
>> 1
>> >> >>> million such records, I would expect this would be must faster
>> than
>> I
>> >> >>> get
>> >> >>> all 1 billion records into my map function and then do condition
>> >> check.
>> >> >>>
>> >> >>> By searching on this board and HBase documents, I tried to
>> implement
>> >> my
>> >> >>> own subclass of TableInputFormat and set a ColumnValueFilter
in
>> >> >>> configure
>> >> >>> method.
>> >> >>>
>> >> >>> public class TableInputFilterFormat extends TableInputFormat
>> >> implements
>> >> >>>     JobConfigurable {
>> >> >>>   private final Log LOG =
>> >> >>> LogFactory.getLog(TableInputFilterFormat.class);
>> >> >>>
>> >> >>>   public static final String FILTER_LIST =
>> >> "hbase.mapred.tablefilters";
>> >> >>>
>> >> >>>   public void configure(JobConf job) {
>> >> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>> >> >>>
>> >> >>>     String colArg = job.get(COLUMN_LIST);
>> >> >>>     String[] colNames = colArg.split(" ");
>> >> >>>     byte [][] m_cols = new byte[colNames.length][];
>> >> >>>     for (int i = 0; i < m_cols.length; i++) {
>> >> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
>> >> >>>     }
>> >> >>>     setInputColums(m_cols);
>> >> >>>
>> >> >>>     ColumnValueFilter filter = new
>> >> >>>
>> >>
>> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
>> >> >>> Bytes.toBytes("0"));
>> >> >>>     setRowFilter(filter);
>> >> >>>
>> >> >>>     try {
>> >> >>>       setHTable(new HTable(new HBaseConfiguration(job),
>> >> >>> tableNames[0].getName()));
>> >> >>>     } catch (Exception e) {
>> >> >>>       LOG.error(e);
>> >> >>>     }
>> >> >>>   }
>> >> >>> }
>> >> >>>
>> >> >>> However, The M/R job with RowFilter is much slower than the
M/R
>> job
>> >> w/o
>> >> >>> RowFilter. During the process many tasked are failed with sth
like
>> >> "Task
>> >> >>> attempt_200812091733_0063_m_000019_1 failed to report status
for
>> 604
>> >> >>> seconds. Killing!". I am wondering if RowFilter can really
>> decrease
>> >> the
>> >> >>> record feeding from 1 billion to 1 million? If it cannot, is
there
>> >> any
>> >> >>> other method to address this issue?
>> >> >>>
>> >> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
>> >> >>>
>> >> >>> Thank you so much in advance!
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
>> >> Sent from the HBase User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21079684.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message