Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 65497 invoked from network); 18 Dec 2008 16:40:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Dec 2008 16:40:50 -0000 Received: (qmail 32523 invoked by uid 500); 18 Dec 2008 16:41:01 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 32503 invoked by uid 500); 18 Dec 2008 16:41:01 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 32492 invoked by uid 99); 18 Dec 2008 16:41:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2008 08:41:01 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erikholstad@gmail.com designates 209.85.198.241 as permitted sender) Received: from [209.85.198.241] (HELO rv-out-0708.google.com) (209.85.198.241) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2008 16:40:39 +0000 Received: by rv-out-0708.google.com with SMTP id k29so632718rvb.0 for ; Thu, 18 Dec 2008 08:40:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=VGnF6Ghds/msOm2WT3tA79HgUHIypGjj2jAd4Laro9I=; b=mMEYtkNWQ79Gj3JWM7pDGWpIbG/1QxxHdTfZXSqs9qFy83UvhJRnPtYHTJngEStoqC nvfy6gQqVlLI+wRAGFEk0SFZIAVRKFG6pmMNWGx47JOewEuH8gqv9jfzMaIPP7CmL5zC 6eBOEwUK/NYj7N2fKbqLMkIo+zRN+gjk3ajJ0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=q2CWgMjt1l9w9sQBAqj5rhsLTfx3koTlvIzAC43XtkYn+CMyLqBZACpxcT/KhbGMpV w1mAQ+z8CztLwb8F0uMnaxV3B5LHJJg7lRHJx4YR0mUb4QEe1xxCCEsleWIH1S0fUI64 XFxJRQoBEd57vjTgx8xmTKjpaKJXXKpHOesVQ= Received: by 10.141.152.8 with SMTP id e8mr1075315rvo.77.1229618418060; Thu, 18 Dec 2008 08:40:18 -0800 (PST) Received: by 10.141.23.6 with HTTP; Thu, 18 Dec 2008 08:40:18 -0800 (PST) Message-ID: <74f4d40b0812180840m52ce7e9m7283f27a603ec07a@mail.gmail.com> Date: Thu, 18 Dec 2008 08:40:18 -0800 From: "Erik Holstad" To: hbase-user@hadoop.apache.org Subject: Re: How to read a subset of records based on a column value in a M/R job? In-Reply-To: <21066895.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_23813_28280448.1229618418053" References: <20963771.post@talk.nabble.com> <21063403.post@talk.nabble.com> <49498A7A.1080006@duboce.net> <21066895.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_23813_28280448.1229618418053 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Tigertail! Not sure if I understand you original problem correct, but it seemed to me that you wanted to just get the rows with the value 1 in a column, right? Did you try to only put that column in there for the rows that you want to get and use that as an input to the MR? I haven't timed my MR jobs with this approach so I'm not sure how it is handled internally, but maybe it is worth giving it a try. Regards Erik On Wed, Dec 17, 2008 at 8:37 PM, tigertail wrote: > > Hi St. Ack, > > Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4 > CPUs). > Suppose the 1M row keys are known beforehand and saved in an file, I just > read each key into a mapper and use table.getRow(key) to get the record. I > also tried to increase the # of map tasks, but it did not improve the > performance. Actually, even worse. Many tasks are failed/killed with sth > like "no response in 600 seconds." > > > stack-3 wrote: > > > > For A2. below, how many map tasks? How did you split the 1M you wanted > > to fetch? How many of them ran concurrently? > > St.Ack > > > > > > tigertail wrote: > >> Hi, can anybody help? Hopefully the following can be helpful to make my > >> question clear if it was not in my last post. > >> > >> A1. I created a table in HBase and then I inserted 10 million records > >> into > >> the table. > >> A2. I ran a M/R program with totally 10 million "get by rowkey" > operation > >> to > >> read the 10M records out and it took about 3 hours to finish. > >> A3. I also ran a M/R program which used TableMap to read the 10M records > >> out > >> and it just took 12 minutes. > >> > >> Now suppose I only need to read 1 million records whose row keys are > >> known > >> beforehand (and let's suppose the worst case the 1M records are evenly > >> distributed in the 10M records). > >> > >> S1. I can use 1M "get by rowkey". But it is slow. > >> S2. I can also simply use TableMap and only output the 10M records in > the > >> map function but it actually read the whole table. > >> > >> Q1. Is there some more efficient way to read the 1M records, WITHOUT > >> PASSING > >> THOUGH THE WHOLE TABLE? > >> > >> How about if I have 1 billion records in an HBase table and I only need > >> to > >> read 1 million records in the following two scenarios. > >> > >> Q2. suppose their row keys are known beforehand > >> Q3. or suppose these 1 million records have the same value on a column > >> > >> Any input would be greatly appreciated. Thank you so much! > >> > >> > >> tigertail wrote: > >> > >>> For example, I have a HBase table with 1 billion records. Each record > >>> has > >>> a column named 'f1:testcol'. And I want to only get the records with > >>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1 > >>> million such records, I would expect this would be must faster than I > >>> get > >>> all 1 billion records into my map function and then do condition check. > >>> > >>> By searching on this board and HBase documents, I tried to implement my > >>> own subclass of TableInputFormat and set a ColumnValueFilter in > >>> configure > >>> method. > >>> > >>> public class TableInputFilterFormat extends TableInputFormat implements > >>> JobConfigurable { > >>> private final Log LOG = > >>> LogFactory.getLog(TableInputFilterFormat.class); > >>> > >>> public static final String FILTER_LIST = "hbase.mapred.tablefilters"; > >>> > >>> public void configure(JobConf job) { > >>> Path[] tableNames = FileInputFormat.getInputPaths(job); > >>> > >>> String colArg = job.get(COLUMN_LIST); > >>> String[] colNames = colArg.split(" "); > >>> byte [][] m_cols = new byte[colNames.length][]; > >>> for (int i = 0; i < m_cols.length; i++) { > >>> m_cols[i] = Bytes.toBytes(colNames[i]); > >>> } > >>> setInputColums(m_cols); > >>> > >>> ColumnValueFilter filter = new > >>> > ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL, > >>> Bytes.toBytes("0")); > >>> setRowFilter(filter); > >>> > >>> try { > >>> setHTable(new HTable(new HBaseConfiguration(job), > >>> tableNames[0].getName())); > >>> } catch (Exception e) { > >>> LOG.error(e); > >>> } > >>> } > >>> } > >>> > >>> However, The M/R job with RowFilter is much slower than the M/R job w/o > >>> RowFilter. During the process many tasked are failed with sth like > "Task > >>> attempt_200812091733_0063_m_000019_1 failed to report status for 604 > >>> seconds. Killing!". I am wondering if RowFilter can really decrease the > >>> record feeding from 1 billion to 1 million? If it cannot, is there any > >>> other method to address this issue? > >>> > >>> I am using Hadoop 0.18.2 and HBase 0.18.1. > >>> > >>> Thank you so much in advance! > >>> > >>> > >>> > >> > >> > > > > > > > > -- > View this message in context: > http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html > Sent from the HBase User mailing list archive at Nabble.com. > > ------=_Part_23813_28280448.1229618418053--