Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4FC410BA9 for ; Thu, 5 Dec 2013 10:24:25 +0000 (UTC) Received: (qmail 8475 invoked by uid 500); 5 Dec 2013 10:23:21 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 8317 invoked by uid 500); 5 Dec 2013 10:23:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 8302 invoked by uid 99); 5 Dec 2013 10:23:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Dec 2013 10:23:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.169 as permitted sender) Received: from [209.85.223.169] (HELO mail-ie0-f169.google.com) (209.85.223.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Dec 2013 10:23:01 +0000 Received: by mail-ie0-f169.google.com with SMTP id e14so29628393iej.0 for ; Thu, 05 Dec 2013 02:22:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=LxMDdWbhdD3mwYaCd/5iUnWIRtwiQ2KtXffD5YDxPnE=; b=FG0gL2xTrtOwR57VSUhwwhG8h7QQCmP/hYXgIVyCU9fxE2cCAJ40ME9LtyRhfzR6xb ILqzESH4Yzb58zQpUPo0eNtLkfcuRaM7IVxEZ+IJa0XfhhzC5FY1QbEmryXgYRimAAwS Ngj047pL/v+BI1HOPJmlKHZfX2eg+cGs6PX+zRn48PcTgPNYCfYSkQVuSZp3EbBePpU0 eeailHi4qiv47yGd1jtP87FUxMc7dUc0PA3qM8/K2gMzhErFbgdJbENBAfu3+9289+K6 E56UzebAxPMo63xAsRgprYKG1MLvuuwy+/e53nQOwdkLABHXBhYqXpzFdTIhoXW87cOm GeSQ== X-Gm-Message-State: ALoCoQlVu8lUBCKrEok+GVSXKS251V6kMz1c7tpcDPYBEZHE/d6joXmvZ8XUxT8Swr1Ts52crY26 X-Received: by 10.43.117.131 with SMTP id fm3mr53683694icc.33.1386238960385; Thu, 05 Dec 2013 02:22:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.234.225 with HTTP; Thu, 5 Dec 2013 02:22:20 -0800 (PST) In-Reply-To: References: From: Harsh J Date: Thu, 5 Dec 2013 15:52:20 +0530 Message-ID: Subject: Re: Check compression codec of an HDFS file To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org If you're looking for file header/contents based inspection, you could download the file and run the Linux utility 'file' on the file, and it should tell you the format. I don't know about Snappy (AFAIK, we don't have a snappy frame/container format support in Hadoop yet, although upstream Snappy issue 34 seems resolved now), but Gzip files can be identified simply by their header bytes for the magic sequence. If its sequence files you are looking to analyse, a simple way is to read its first few hundred bytes, which should have the codec string in it. Programmatically you can use https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec() for sequence files. On Thu, Dec 5, 2013 at 5:10 AM, alex bohr wrote: > What's the best way to check the compression codec that an HDFS file was > written with? > > We use both Gzip and Snappy compression so I want a way to determine how a > specific file is compressed. > > The closest I found is the getCodec but that relies on the file name suffix > ... which don't exist since Reducers typically don't add a suffix to the > filenames they create. > > Thanks -- Harsh J