From common-issues-return-146729-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Thu Jan 11 21:10:05 2018
Return-Path: <common-issues-return-146729-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id 40CBD180656
	for <archive-asf-public@eu.ponee.io>; Thu, 11 Jan 2018 21:10:05 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 29D9B160C23; Thu, 11 Jan 2018 20:10:05 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id 70E85160C13
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 11 Jan 2018 21:10:04 +0100 (CET)
Received: (qmail 79481 invoked by uid 500); 11 Jan 2018 20:10:03 -0000
Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:common-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:common-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:common-issues@hadoop.apache.org>
List-Id: <common-issues.hadoop.apache.org>
Delivered-To: mailing list common-issues@hadoop.apache.org
Received: (qmail 79470 invoked by uid 99); 11 Jan 2018 20:10:03 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jan 2018 20:10:03 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D3B011A069F
	for <common-issues@hadoop.apache.org>; Thu, 11 Jan 2018 20:10:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -100.711
X-Spam-Level:
X-Spam-Status: No, score=-100.711 tagged_above=-999 required=6.31
	tests=[RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001,
	T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id n73Q5wkrcTRM for <common-issues@hadoop.apache.org>;
	Thu, 11 Jan 2018 20:10:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 18D5D5F30C
	for <common-issues@hadoop.apache.org>; Thu, 11 Jan 2018 20:10:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5A982E0959
	for <common-issues@hadoop.apache.org>; Thu, 11 Jan 2018 20:10:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 14E9721E7F
	for <common-issues@hadoop.apache.org>; Thu, 11 Jan 2018 20:10:00 +0000 (UTC)
Date: Thu, 11 Jan 2018 20:10:00 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12986707.1467688924000.631360.1515701400082@Atlassian.JIRA>
In-Reply-To: <JIRA.12986707.1467688924000@Atlassian.JIRA>
References: <JIRA.12986707.1467688924000@Atlassian.JIRA> <JIRA.12986707.1467688924138@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HADOOP-13340) Compress Hadoop Archive output
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322869#comment-16322869 ] 

Jason Lowe commented on HADOOP-13340:
-------------------------------------

Choosing which files to compress doesn't really solve the issues I brought up in my previous comment.  Even if we choose only to compress some of the files but not all of them, unless we choose a splittable/seekable codec and provide transparent decoding in the HarFileSystem layer it could change the semantics of how an application accesses the data before and after it enters the .har archive.  (e.g.: app was working just fine on uncompressed data but doesn't gracefully handle the compressed data, especially if it isn't splittable).  That would be adding compression to the har that is not transparent.  I suppose as long as that's clearly documented and the user expects that behavior it could be OK.

What needs to be clarified is the requirements and expectations of this feature.  Is the compression transparent  (i.e.: data appears to be exactly as it was to anyone accessing the .har archive yet it is actually stored compressed and transparently decoded during access) or simply each file (optionally) compressed as it is added to the archive?  The latter has a straightforward workaround today (i.e.: simply compress the original files before archiving them).  The former would require support in HarFileSystem but could be nice for the common use-case for .har archives which is packing together a lot of relatively small files.  The compression could work across file boundaries achieving a greater compression ratio than if each flie were compressed separately, with the overhead of needing to decode up to an entire codec block to access a file's contents.


> Compress Hadoop Archive output
> ------------------------------
>
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, it's very neccessary for data retention for everyone (small files problem and compress data).


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org