Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD0B71878F for ; Sun, 2 Aug 2015 13:23:04 +0000 (UTC) Received: (qmail 49590 invoked by uid 500); 2 Aug 2015 13:23:02 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 49515 invoked by uid 500); 2 Aug 2015 13:23:02 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 49504 invoked by uid 99); 2 Aug 2015 13:23:02 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2015 13:23:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 08438D9A38 for ; Sun, 2 Aug 2015 13:23:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.001 X-Spam-Level: **** X-Spam-Status: No, score=4.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 3_phzSX-Poxx for ; Sun, 2 Aug 2015 13:22:54 +0000 (UTC) Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 470E720595 for ; Sun, 2 Aug 2015 13:22:54 +0000 (UTC) Received: by pdrg1 with SMTP id g1so62905453pdr.2 for ; Sun, 02 Aug 2015 06:22:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=2F7gNqFa+dXTxoB/AbmiAMwg8Jh3b6topw/GyyCK2tk=; b=Sy9k6E9JKrbdZAjK1YAA4ZziIx2dGeinlOP1gKwh8S3Hzei2Cm8hCzrAu8LarQXYSK Nwk3MKLQj1+O7VJMTvavZxsQDNuEv/qOKDa1QQ3Z0Jo1PNQxrY+Xyyu+0iWgnFizn3vq /k9cuGKAwXNoHM05hQo1Lo2SPfN0eAOhgymjmGGwDvDbPgsxD8A3TPFa49lEWdytg3I7 GWtf83OUgzGP7kRepjCAIBbiiKLFVrJSkDskGXGFXgRsp7FFVw6ADK7VWpQdTLVgKBmg wL2u8KD26nR60WvwEe9/1R7p/4KbIF96hu3EgLmT0sZ4m25jhL7NDURhKal22gPr0CO9 hlZA== X-Gm-Message-State: ALoCoQnE0R060RozlJXFGmqU719uxFIL6liCGtu8R0o/DjeXBSWUYRe4ymCE8aYsUvXrwCo4XTMQ MIME-Version: 1.0 X-Received: by 10.70.108.137 with SMTP id hk9mr26587960pdb.105.1438521723091; Sun, 02 Aug 2015 06:22:03 -0700 (PDT) Received: by 10.70.90.12 with HTTP; Sun, 2 Aug 2015 06:22:02 -0700 (PDT) Received: by 10.70.90.12 with HTTP; Sun, 2 Aug 2015 06:22:02 -0700 (PDT) In-Reply-To: References: Date: Sun, 2 Aug 2015 09:22:02 -0400 Message-ID: Subject: Re: scan column families with different time ranges From: Jean-Marc Spaggiari To: user Content-Type: multipart/alternative; boundary=001a1133d9404e9c38051c53f013 --001a1133d9404e9c38051c53f013 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Just thinking at loud : "Cutting out the old store files could well also reduce disk IO for that family by 100x." What is "recent" for your data? More than 7 days? Or less? Don't you have weekly major compactions? If so and if you are scanning for more than 7 days, then you will read the older files anyway, no? JM Le 2015-08-02 05:57, "Ted Yu" a =C3=A9crit : > Dave: > I wonder if Filter response can be enhanced in the following manner: > > http://pastebin.com/sb6apTPm > > My approach is based on using essential column family (column family A in > your case) to guide whether the remaining column families should be loade= d. > To be specific, if outside the TimeRange you specify (last day), your > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW. > > What do you think ? > > Cheers > > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham wrote: > > > Thanks for brainstorming, Ted. That sounds like option 2 I listed usin= g > a > > separate scanner for A vs B which "adds complexity to the job and gives > up > > the atomicity/consistency guarantees as new writes hit both column > > families". > > > > On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu wrote: > > > > > Can you achieve your goal with two scans ? > > > The first scan specifies TimeRange corresponding to last day. This sc= an > > > returns both column families. > > > The other scan specifies TimeRange excluding last day. This scan > returns > > > column family A. > > > > > > Cheers > > > > > > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham > wrote: > > > > > > > Hi Ted, > > > > > > > > Thanks for the suggestion, but I'm not sure that it helps my case > much. > > > I > > > > wasn't very familiar with the feature, and it doesn't seem very wel= l > > > > documented - I had to go to the source and the originating JIRA to > > > > understand how it works. It sounds like it allows you to mark whic= h > > > column > > > > families the filter operates on ("essential" seems an odd name). I= f > > any > > > > data from those column families passes the filter, then the scan > loads > > > and > > > > includes data from the remaining families without filtering it. In > my > > > > case, it's not clear from a row's family A whether or not family B > for > > > that > > > > row is required (though that could probably be added). Moreover, > even > > > if a > > > > row has recent data, we don't want to load all the old data from th= at > > > row. > > > > We'd prefer to be able to entirely skip reading the data off disk f= or > > the > > > > old store files. > > > > > > > > Dave > > > > > > > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu wrote: > > > > > > > > > Have you considered using essential column family feature (throug= h > > > > Filter) > > > > > ? > > > > > In your case A would be the essential column family. > > > > > Within TimeRange for recent data, the filter would return both > column > > > > > families. > > > > > Outside the TimeRange, only family A is returned. > > > > > > > > > > Cheers > > > > > > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham > > > wrote: > > > > > > > > > > > I have a table with 2 column families, call them A and B, with > new > > > data > > > > > > regularly being added. They are very different sizes: B is 100x > the > > > > size > > > > > of > > > > > > A. Among other uses for this data, I have a MapReduce job that > > needs > > > > to > > > > > > read all of A, but only recent data from B (e.g. last day). He= re > > are > > > > > some > > > > > > methods I've considered: > > > > > > > > > > > > 1. Use a Filter to get throw out older data from B (this is > > what I > > > > > > currently do). However, all the data from B still needs to = be > > > read > > > > > from > > > > > > disk, causing a disk IO bottleneck. > > > > > > 2. Configure the table input format to read from B only, > using a > > > > > > TimeRange for recent data, and have each map task open a > > separate > > > > > > scanner > > > > > > for A (without a TimeRange) then merge the data in the map > task. > > > > > > However, > > > > > > this adds complexity to the job and gives up the > > > > atomicity/consistency > > > > > > guarantees as new writes hit both column families. > > > > > > 3. Add a new column family C to the table with an additional > > copy > > > of > > > > > the > > > > > > data in B, but set a TTL on it. All writes duplicate the da= ta > > > > written > > > > > > to B > > > > > > and C. Change the scan to include C instead of B. However, > > this > > > > adds > > > > > > all > > > > > > the overhead of another column family, more writes, and havi= ng > > to > > > > set > > > > > > the > > > > > > TTL to the maximum of any time window I want to scan > > efficiently. > > > > > > 4. Implement an enhancement to HBase's Scan to allow giving > each > > > > > column > > > > > > family its own TimeRange. The job would then be able to ski= p > > most > > > > old > > > > > > large store files (hopefully all of them with tiered > compaction > > at > > > > > some > > > > > > point). > > > > > > > > > > > > Does anyone have other suggestions? Would HBase be willing to > > accept > > > > > > updating Scan to have different TimeRange's for each column > > families? > > > > > > > > > > > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > --001a1133d9404e9c38051c53f013--