Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 94328 invoked from network); 20 Jul 2009 19:10:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Jul 2009 19:10:54 -0000 Received: (qmail 84853 invoked by uid 500); 20 Jul 2009 19:11:59 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 84774 invoked by uid 500); 20 Jul 2009 19:11:59 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 84764 invoked by uid 99); 20 Jul 2009 19:11:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jul 2009 19:11:59 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jlist@streamy.com designates 72.34.249.3 as permitted sender) Received: from [72.34.249.3] (HELO mail.streamy.com) (72.34.249.3) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jul 2009 19:11:49 +0000 Received: from [192.168.249.50] (static-98-112-71-211.lsanca.dsl-w.verizon.net [98.112.71.211]) by ns1.streamy.com (8.13.1/8.13.1) with ESMTP id n6KJBQj8006357 for ; Mon, 20 Jul 2009 12:11:26 -0700 Message-ID: <4A64C14D.7040705@streamy.com> Date: Mon, 20 Jul 2009 12:11:09 -0700 From: Jonathan Gray User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: hbase-user@hadoop.apache.org Subject: Re: A question about MapReduce job extracts recent data from HBase/Bigtable References: In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Level: ** X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on ns1.streamy.com X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=2.4 required=5.0 tests=DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 The row key is (website,stamp) so the table is GROUP BY website and then ORDER BY stamp. If you'd want to just get recent data, you'd do some kind of row filter server-side so you were only returned clicks from the range you specified for that particular MR job. Does that make sense? Do you think it's more complex than that? They are grouping by the website, not strictly ordering by the stamp, so there's no way to prevent a full table scan (server-side), you can use filters to prevent all the unnecessary data from moving back to the client/job. JG Schubert Zhang wrote: > Hi all, > > I have a periodically scheduled MapReduce job need to extract recent data > from a HBase table for analysis, and avoid scanning/reading the analyzed > data. Do you have any idea? > > In the Google paper Data> > Section: 8.1 Google Analytics > > The raw click table (�200 TB) maintains a row for each end-user session. The > row name is a tuple containing the website's name and the time at which the > session was created. This schema ensures that sessions that visit the same > web site are contiguous, and that they are sorted chronologically. This > table compresses to 14% of its original size. > > The summary table (~20 TB) contains various predefined summaries for each > website. This table is generated from the raw click table by periodically > scheduled MapReduce jobs. Each MapReduce job extracts recent session data > from the raw click table. The overall system's throughput is limited by the > throughput of GFS. This table compresses to 29% of its original size. > > Can anybody share your ideas about how "Each MapReduce job extracts recent > session data from the raw click table."? > > Thanks! > Schubert >