Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C09317E08 for ; Sun, 2 Aug 2015 03:07:22 +0000 (UTC) Received: (qmail 89344 invoked by uid 500); 2 Aug 2015 03:07:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 89270 invoked by uid 500); 2 Aug 2015 03:07:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 89258 invoked by uid 99); 2 Aug 2015 03:07:19 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2015 03:07:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E57DF1A94AE for ; Sun, 2 Aug 2015 03:07:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.001 X-Spam-Level: *** X-Spam-Status: No, score=3.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id WsezTy_ZzQfr for ; Sun, 2 Aug 2015 03:07:12 +0000 (UTC) Received: from mail-yk0-f178.google.com (mail-yk0-f178.google.com [209.85.160.178]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id C212D24E11 for ; Sun, 2 Aug 2015 03:07:11 +0000 (UTC) Received: by ykek23 with SMTP id k23so14376274yke.0 for ; Sat, 01 Aug 2015 20:06:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=rd3bPV9EpRekcFSXSklUFHmDIiSLwCIqzEwymIPaaC0=; b=AX42HSYcsbtb35i1m7DYj3t3A901HcTDFjl2wq4PHhF2ddElktscngC/lxlHMXrkPU 7p/+dGoDp1eLHP1n1Z/RwNDebFe8zRx9L3UIzx0gVw1kNzokyLdznZU5jJUNBUMMcAy3 /AEeaqJ5Z1bcjOp0Sf+yf5OFYF7w4P2LLOM8iQTLq7rtZdvHBceHXJcIHBQ5ig0KYRSl F3GKBbjaFYYg7sEQOSXrewRI3amwlgRfW5EcC/BVoc0NpVUeYUS9unwW2+5uPx6OXDgy D3rjeX+bwLPI6CIF6ZM9NvzEYBUjSLZ8l5Ws2NlzOu9os+4KjenaGVNOFlQr/yVTBH4L IVQg== X-Received: by 10.129.138.129 with SMTP id a123mr12280500ywg.139.1438484780234; Sat, 01 Aug 2015 20:06:20 -0700 (PDT) MIME-Version: 1.0 Sender: ddlatham@gmail.com Received: by 10.129.94.131 with HTTP; Sat, 1 Aug 2015 20:06:00 -0700 (PDT) In-Reply-To: References: From: Dave Latham Date: Sat, 1 Aug 2015 20:06:00 -0700 X-Google-Sender-Auth: vOpEQgJ8bFeN3l-nUFYhhFch5dU Message-ID: Subject: Re: scan column families with different time ranges To: hbase-user Content-Type: multipart/alternative; boundary=94eb2c0807fa573c63051c4b5629 --94eb2c0807fa573c63051c4b5629 Content-Type: text/plain; charset=UTF-8 Thanks for brainstorming, Ted. That sounds like option 2 I listed using a separate scanner for A vs B which "adds complexity to the job and gives up the atomicity/consistency guarantees as new writes hit both column families". On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu wrote: > Can you achieve your goal with two scans ? > The first scan specifies TimeRange corresponding to last day. This scan > returns both column families. > The other scan specifies TimeRange excluding last day. This scan returns > column family A. > > Cheers > > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham wrote: > > > Hi Ted, > > > > Thanks for the suggestion, but I'm not sure that it helps my case much. > I > > wasn't very familiar with the feature, and it doesn't seem very well > > documented - I had to go to the source and the originating JIRA to > > understand how it works. It sounds like it allows you to mark which > column > > families the filter operates on ("essential" seems an odd name). If any > > data from those column families passes the filter, then the scan loads > and > > includes data from the remaining families without filtering it. In my > > case, it's not clear from a row's family A whether or not family B for > that > > row is required (though that could probably be added). Moreover, even > if a > > row has recent data, we don't want to load all the old data from that > row. > > We'd prefer to be able to entirely skip reading the data off disk for the > > old store files. > > > > Dave > > > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu wrote: > > > > > Have you considered using essential column family feature (through > > Filter) > > > ? > > > In your case A would be the essential column family. > > > Within TimeRange for recent data, the filter would return both column > > > families. > > > Outside the TimeRange, only family A is returned. > > > > > > Cheers > > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham > wrote: > > > > > > > I have a table with 2 column families, call them A and B, with new > data > > > > regularly being added. They are very different sizes: B is 100x the > > size > > > of > > > > A. Among other uses for this data, I have a MapReduce job that needs > > to > > > > read all of A, but only recent data from B (e.g. last day). Here are > > > some > > > > methods I've considered: > > > > > > > > 1. Use a Filter to get throw out older data from B (this is what I > > > > currently do). However, all the data from B still needs to be > read > > > from > > > > disk, causing a disk IO bottleneck. > > > > 2. Configure the table input format to read from B only, using a > > > > TimeRange for recent data, and have each map task open a separate > > > > scanner > > > > for A (without a TimeRange) then merge the data in the map task. > > > > However, > > > > this adds complexity to the job and gives up the > > atomicity/consistency > > > > guarantees as new writes hit both column families. > > > > 3. Add a new column family C to the table with an additional copy > of > > > the > > > > data in B, but set a TTL on it. All writes duplicate the data > > written > > > > to B > > > > and C. Change the scan to include C instead of B. However, this > > adds > > > > all > > > > the overhead of another column family, more writes, and having to > > set > > > > the > > > > TTL to the maximum of any time window I want to scan efficiently. > > > > 4. Implement an enhancement to HBase's Scan to allow giving each > > > column > > > > family its own TimeRange. The job would then be able to skip most > > old > > > > large store files (hopefully all of them with tiered compaction at > > > some > > > > point). > > > > > > > > Does anyone have other suggestions? Would HBase be willing to accept > > > > updating Scan to have different TimeRange's for each column families? > > > > > > > > > > > > Dave > > > > > > > > > > --94eb2c0807fa573c63051c4b5629--