Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3F1C9D06 for ; Sun, 6 Nov 2011 04:27:21 +0000 (UTC) Received: (qmail 34960 invoked by uid 500); 6 Nov 2011 04:27:20 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 34884 invoked by uid 500); 6 Nov 2011 04:27:17 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 34689 invoked by uid 99); 6 Nov 2011 04:27:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Nov 2011 04:27:15 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Nov 2011 04:27:12 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 9FE7A33BAE for ; Sun, 6 Nov 2011 04:26:51 +0000 (UTC) Date: Sun, 6 Nov 2011 04:26:51 +0000 (UTC) From: "jinglong.liujl (Created) (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: <1805912074.4135.1320553611656.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (HDFS-2542) Transparent compression storage in HDFS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Transparent compression storage in HDFS --------------------------------------- Key: HDFS-2542 URL: https://issues.apache.org/jira/browse/HDFS-2542 Project: Hadoop HDFS Issue Type: Bug Reporter: jinglong.liujl As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs by compression. Different from HDFS-2115, this issue focus on compress storage. Some idea like below: To do: 1. compress cold data. Cold data: After writing (or last read), data has not touched by anyone for a long time. Hot data: After writing, many client will read it , maybe it'll delele soon. Because hot data compression is not cost-effective, we only compress cold data. In some cases, some data in file can be access in high frequency, but in the same file, some data may be cold data. To distinguish them, we compress in block level. 2. compress data which has high compress ratio. To specify high/low compress ratio, we should try to compress data, if compress ratio is too low, we'll never compress them. 2. forward compatibility. After compression, data format in datanode has changed. Old client will not access them. To solve this issue, we provide a mechanism which decompress on datanode. 3. support random access and append. As HDFS-2115, random access can be support by index. We separate data before compress by fixed-length (we call these fixed-length data as "chunk"), every chunk has its index. When random access, we can seek to the nearest index, and read this chunk for precise position. 4. async compress to avoid compression slow down running job. In practice, we found the cluster CPU usage is not uniform. Some clusters are idle at night, and others are idle at afternoon. We should make compress task running in full speed when cluster idle, and in low speed when cluster busy. Will do: 1. client specific codec and support compress transmission. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira