Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 605CCDCEB for ; Sun, 29 Jul 2012 17:42:00 +0000 (UTC) Received: (qmail 25242 invoked by uid 500); 29 Jul 2012 17:41:59 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 25163 invoked by uid 500); 29 Jul 2012 17:41:59 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 25155 invoked by uid 99); 29 Jul 2012 17:41:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Jul 2012 17:41:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of brock@cloudera.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Jul 2012 17:41:53 +0000 Received: by vcbfl11 with SMTP id fl11so4773385vcb.35 for ; Sun, 29 Jul 2012 10:41:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=XBlfoG0/cWwjS4YHIiTO4jDEfbPkBNWkKMWZrvS4U8Y=; b=pdPheQhDyAP9OIp89fcDXfVWJXj95ooriIKQ10kpssSgJ87EOSU9ZPRpHLEji63Ulj 6+pFrsLLpTj6C8Icc4sA6oxbxtP25CGADNbPMn9Xldut82r79jUkJdFWjBl5i1dnKqX/ 4rvknUqR/9cuPwV99oq033vnNOqEtp2rs0xmbdy8a07oFrg7ME4F2qx0zVL72WVlWj32 Tb5G5a7nKyFq8ssJnP7x1zpapsjj2RHsA5xZ9Y4YaqlummKbxqI0IE4L7ESlIMLhip/X 9WsFQ05diVdIlp4RJOJuQwtn0/UJub03/2jo0c1uGe76Rz3DItetawG81w+qf+zL2N60 4LPg== Received: by 10.220.150.138 with SMTP id y10mr8475102vcv.73.1343583693001; Sun, 29 Jul 2012 10:41:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.95.42 with HTTP; Sun, 29 Jul 2012 10:41:12 -0700 (PDT) In-Reply-To: References: From: Brock Noland Date: Sun, 29 Jul 2012 12:41:12 -0500 Message-ID: Subject: Re: Understanding compression in hdfs To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d043c7bccbf521d04c5fb74ec X-Gm-Message-State: ALoCoQkgyG2/IG3W+shzVijc4ObkluVTmW5SZQvBwvjwYEQXTxLuKEelXbBObM/v3kUOyn6uMymV X-Virus-Checked: Checked by ClamAV on apache.org --f46d043c7bccbf521d04c5fb74ec Content-Type: text/plain; charset=ISO-8859-1 Also note that HDFS already does checksums which I believe you can retrieve: http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity Brock On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen wrote: > Thanks! > I'll dig into those classes to figure out my next step. > > Anyway, I just realized the block-level compression has nothing to do with > HDFS blocks. An HDFS block can contain an unknown number of compressed > blocks, which makes my efforts kind of worthless. > > thanks again! > > > On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg wrote: > >> What if you wrote a CompressionOutputStream class that wraps around the >> existing ones and outputs a hash per bytes and a CompressionInputStream >> that checks them? ...and a Codec that wraps your compressors around >> arbitrary existing codecs. >> >> Sounds like a bunch of work, and I'm not sure where you would store the >> hashes, but it would get the data into your clutches the instant it's >> available. >> >> - Tim. >> >> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" wrote: >> >> Hi, >> I've created a SequeceFile.Writer with block-level compression. >> I'd like to create a SHA1 hash for each block written. How do I do that? >> I didn't see any way to take the compression under my control in order to >> know when a block is over. >> >> Thanks, >> Yaron >> >> >> ------------------------------ >> The information contained in this email is intended only for the personal >> and confidential use of the recipient(s) named above. The information and >> any attached documents contained in this message may be Exar confidential >> and/or legally privileged. If you are not the intended recipient, you are >> hereby notified that any review, use, dissemination or reproduction of this >> message is strictly prohibited and may be unlawful. If you have received >> this communication in error, please notify us immediately by return email >> and delete the original message. >> > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ --f46d043c7bccbf521d04c5fb74ec Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Also note that HDFS already does checksums which I believe you can=A0retrie= ve:

http://hadoop.apache.org/common/docs/r1.0.3/api/org/apach= e/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)<= /div>


Brock

On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <yaron.gonen@gmail.com> wrote:
Thanks!
I'll dig into those classes to figure out = my next step.

Anyway, I just realized the block-le= vel compression has nothing to do with HDFS blocks. An HDFS block can conta= in an unknown number of compressed blocks, which makes my efforts kind of w= orthless.

thanks again!


On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <Tim.Brob= erg@exar.com> wrote:
What if you wrote a CompressionOutputStream class that wraps around th= e existing ones and outputs a hash per <n> bytes and a CompressionInp= utStream that checks them? ...and a Codec that wraps your compressors aroun= d arbitrary existing codecs.

Sounds like a bunch of work, and I'm not sure where you would stor= e the hashes, but it would get the data into your clutches the instant it&#= 39;s available.

=A0=A0 =A0- Tim.

On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <yaron.gonen@gmail.com> wrot= e:

Hi,
I've created a SequeceFile.W= riter with block-level compression.
I'd like to create a SHA1 hash for each block written. How do I do= that? I didn't see any way to take the compression under my control in= order to know when a block is over.

Thanks,
Yaron


The information contained in= this email is intended only for the personal and confidential use of the r= ecipient(s) named above. The information and any attached documents contain= ed in this message may be Exar confidential and/or legally privileged. If you are not the intended recipient, you are = hereby notified that any review, use, dissemination or reproduction of this= message is strictly prohibited and may be unlawful. If you have received t= his communication in error, please notify us immediately by return email and delete the original message.




--
Apache MRUni= t - Unit testing MapReduce - http://incubator.apache.org/mrunit/
--f46d043c7bccbf521d04c5fb74ec--