Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 38198 invoked from network); 19 Jul 2010 12:57:59 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Jul 2010 12:57:59 -0000 Received: (qmail 6479 invoked by uid 500); 19 Jul 2010 12:57:57 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 6049 invoked by uid 500); 19 Jul 2010 12:57:52 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 6040 invoked by uid 99); 19 Jul 2010 12:57:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Jul 2010 12:57:51 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Jul 2010 12:57:43 +0000 Received: by fxm16 with SMTP id 16so2799223fxm.35 for ; Mon, 19 Jul 2010 05:57:22 -0700 (PDT) Received: by 10.239.155.135 with SMTP id i7mr331343hbc.97.1279544242046; Mon, 19 Jul 2010 05:57:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.239.182.199 with HTTP; Mon, 19 Jul 2010 05:56:51 -0700 (PDT) From: Leon Mergen Date: Mon, 19 Jul 2010 14:56:51 +0200 Message-ID: Subject: libhdfs / gzip support To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001485f912b805892e048bbd1cb5 X-Virus-Checked: Checked by ClamAV on apache.org --001485f912b805892e048bbd1cb5 Content-Type: text/plain; charset=ISO-8859-1 Hello, We're using Hadoop in a C-oriented architecture ourselves, using libhdfs for storing files and Hadoop.Pipes for map/reduce jobs. Since the data we're storing benefits a lot from compression, we're currently investigating ways to do this. Ideally we would perform block-level compression: each separate 64MB block of data would be compressed. Hadoop.Pipes seems to provide a way to change the InputReader and OutputReader to enable the GzipCodec, however, I did not find a good way to tell libhdfs to store files compressed. Anyone has any experience with this, and/or ideas how to best approach this problem? We're using Hadoop 0.20.2 Regards, Leon Mergen --001485f912b805892e048bbd1cb5--