Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17621ECD5 for ; Mon, 10 Dec 2012 14:26:52 +0000 (UTC) Received: (qmail 83725 invoked by uid 500); 10 Dec 2012 13:58:51 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 83411 invoked by uid 500); 10 Dec 2012 13:58:50 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 83382 invoked by uid 99); 10 Dec 2012 13:58:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 13:58:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of varun@pinterest.com designates 209.85.223.177 as permitted sender) Received: from [209.85.223.177] (HELO mail-ie0-f177.google.com) (209.85.223.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 13:58:42 +0000 Received: by mail-ie0-f177.google.com with SMTP id k13so11119683iea.22 for ; Mon, 10 Dec 2012 05:58:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :x-gm-message-state; bh=JFVXp7aJbKSs4L4BNNjmlS1dAIO5Ib6e8dqfXEWFfz0=; b=RKe5BkUdfEbZ4E+Ko4jJZ0INaRAsDJs0gdeegLRyIdJg+wPQQ3KnsXCeYUoHXFFmhh KZHl1yhCnJnWGaJvj4IjBW6ve9KwOL7W+GL3SzYPVDNEXhZYexmq8hr3oYL56iURGNyl 8pJINxaHSCI98B3v4gHITiqaFximOB62kFmzFNgUy2hc+e3HSpp4JLSacKXoNaJ+Sa8o ReSpQ1oaI9g4sohg1ekIPhxGdZRL/Ex6h7hGxwnnDGNRT6ecjcYyWGKDDKKGUtjfYoew aMvZGOZrr2+Jm48OSM53sVbULVI7SOMN6j+zay8vbJgbFA9jojit9yF1n5XKfmihtkWo XYiA== MIME-Version: 1.0 Received: by 10.50.40.133 with SMTP id x5mr6652266igk.32.1355147901818; Mon, 10 Dec 2012 05:58:21 -0800 (PST) Received: by 10.231.152.67 with HTTP; Mon, 10 Dec 2012 05:58:21 -0800 (PST) Date: Mon, 10 Dec 2012 05:58:21 -0800 Message-ID: Subject: Filtering/Collection columns during Major Compaction From: Varun Sharma To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=14dae93403d94e554b04d07ff539 X-Gm-Message-State: ALoCoQlfiToV9Dn4Kv9QA3RltSi2V/Tog4+XwnuUxpqqJN2Pw+fK+uRl/oCLJV1eJI+zlcK8DfxF X-Virus-Checked: Checked by ClamAV on apache.org --14dae93403d94e554b04d07ff539 Content-Type: text/plain; charset=ISO-8859-1 Hi, My understanding of major compaction is that it rewrites one store file and does a merge of the memstore, store files on disk and cleans out delete tombstones and puts prior to them and cleans out excess versions. We want to limit the number of columns per row in hbase. Also, we want to limit them in lexicographically sorted order - which means we take the top, say 100 smallest columns (in lexicographical sense) and only keep them while discard the rest. One way to do this would be to clean out columns in a daily mapreduce job. Or another way is to clean them out during the major compaction which can be run daily too. I see, from the code that a major compaction essentially invokes a Scan over the region - so if the Scan is invoked with the appropriate filter (say ColumnCountGetFilter) - would that do the trick ? Thanks Varun --14dae93403d94e554b04d07ff539--