Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of erikholstad@gmail.com
 designates 209.85.198.241 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=q2CWgMjt1l9w9sQBAqj5rhsLTfx3koTlvIzAC43XtkYn+CMyLqBZACpxcT/KhbGMpV
         w1mAQ+z8CztLwb8F0uMnaxV3B5LHJJg7lRHJx4YR0mUb4QEe1xxCCEsleWIH1S0fUI64
         XFxJRQoBEd57vjTgx8xmTKjpaKJXXKpHOesVQ=
Message-ID: <74f4d40b0812180840m52ce7e9m7283f27a603ec07a@mail.gmail.com>
Date: Thu, 18 Dec 2008 08:40:18 -0800
From: "Erik Holstad" <erikholstad@gmail.com>
To: hbase-user@hadoop.apache.org
Subject: Re: How to read a subset of records based on a column value in a M/R
 job?
In-Reply-To: <21066895.post@talk.nabble.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_23813_28280448.1229618418053"
References: <20963771.post@talk.nabble.com> <21063403.post@talk.nabble.com>
	 <49498A7A.1080006@duboce.net> <21066895.post@talk.nabble.com>

------=_Part_23813_28280448.1229618418053
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi Tigertail!
Not sure if I understand you original problem correct, but it seemed to me
that you wanted to just get
the rows with the value 1 in a column, right?

Did you try to only put that column in there for the rows that you want to
get and use that as an input
to the MR?

I haven't timed my MR jobs with this approach so I'm not sure how it is
handled internally, but maybe
it is worth giving it a try.

Regards Erik

On Wed, Dec 17, 2008 at 8:37 PM, tigertail <tyczjs@yahoo.com> wrote:

>
> Hi St. Ack,
>
> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
> CPUs).
> Suppose the 1M row keys are known beforehand and saved in an file, I just
> read each key into a mapper and use table.getRow(key) to get the record. I
> also tried to increase the # of map tasks, but it did not improve the
> performance. Actually, even worse. Many tasks are failed/killed with sth
> like "no response in 600 seconds."
>
>
> stack-3 wrote:
> >
> > For A2. below, how many map tasks?  How did you split the 1M you wanted
> > to fetch? How many of them ran concurrently?
> > St.Ack
> >
> >
> > tigertail wrote:
> >> Hi, can anybody help? Hopefully the following can be helpful to make my
> >> question clear if it was not in my last post.
> >>
> >> A1. I created a table in HBase and then I inserted 10 million records
> >> into
> >> the table.
> >> A2. I ran a M/R program with totally 10 million "get by rowkey"
> operation
> >> to
> >> read the 10M records out and it took about 3 hours to finish.
> >> A3. I also ran a M/R program which used TableMap to read the 10M records
> >> out
> >> and it just took 12 minutes.
> >>
> >> Now suppose I only need to read 1 million records whose row keys are
> >> known
> >> beforehand (and let's suppose the worst case the 1M records are evenly
> >> distributed in the 10M records).
> >>
> >> S1. I can use 1M "get by rowkey". But it is slow.
> >> S2. I can also simply use TableMap and only output the 10M records in
> the
> >> map function but it actually read the whole table.
> >>
> >> Q1. Is there some more efficient way to read the 1M records, WITHOUT
> >> PASSING
> >> THOUGH THE WHOLE TABLE?
> >>
> >> How about if I have 1 billion records in an HBase table and I only need
> >> to
> >> read 1 million records in the following two scenarios.
> >>
> >> Q2. suppose their row keys are known beforehand
> >> Q3. or suppose these 1 million records have the same value on a column
> >>
> >> Any input would be greatly appreciated. Thank you so much!
> >>
> >>
> >> tigertail wrote:
> >>
> >>> For example, I have a HBase table with 1 billion records. Each record
> >>> has
> >>> a column named 'f1:testcol'. And I want to only get the records with
> >>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1
> >>> million such records, I would expect this would be must faster than I
> >>> get
> >>> all 1 billion records into my map function and then do condition check.
> >>>
> >>> By searching on this board and HBase documents, I tried to implement my
> >>> own subclass of TableInputFormat and set a ColumnValueFilter in
> >>> configure
> >>> method.
> >>>
> >>> public class TableInputFilterFormat extends TableInputFormat implements
> >>>     JobConfigurable {
> >>>   private final Log LOG =
> >>> LogFactory.getLog(TableInputFilterFormat.class);
> >>>
> >>>   public static final String FILTER_LIST = "hbase.mapred.tablefilters";
> >>>
> >>>   public void configure(JobConf job) {
> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
> >>>
> >>>     String colArg = job.get(COLUMN_LIST);
> >>>     String[] colNames = colArg.split(" ");
> >>>     byte [][] m_cols = new byte[colNames.length][];
> >>>     for (int i = 0; i < m_cols.length; i++) {
> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
> >>>     }
> >>>     setInputColums(m_cols);
> >>>
> >>>     ColumnValueFilter filter = new
> >>>
> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
> >>> Bytes.toBytes("0"));
> >>>     setRowFilter(filter);
> >>>
> >>>     try {
> >>>       setHTable(new HTable(new HBaseConfiguration(job),
> >>> tableNames[0].getName()));
> >>>     } catch (Exception e) {
> >>>       LOG.error(e);
> >>>     }
> >>>   }
> >>> }
> >>>
> >>> However, The M/R job with RowFilter is much slower than the M/R job w/o
> >>> RowFilter. During the process many tasked are failed with sth like
> "Task
> >>> attempt_200812091733_0063_m_000019_1 failed to report status for 604
> >>> seconds. Killing!". I am wondering if RowFilter can really decrease the
> >>> record feeding from 1 billion to 1 million? If it cannot, is there any
> >>> other method to address this issue?
> >>>
> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
> >>>
> >>> Thank you so much in advance!
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

------=_Part_23813_28280448.1229618418053--