hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Mankude (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2542) Transparent compression storage in HDFS
Date Thu, 10 Nov 2011 20:37:52 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147992#comment-13147992
] 

Hari Mankude commented on HDFS-2542:
------------------------------------

Dedup blocks would be stored in a hdfs filesystem with 3 replicas. In fact, if the deduped
block is a "hot" block with lots of references, replica count can be increased for those blocks
as a policy setting.
                
> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs by compression.
Different from HDFS-2115, this issue focus on compress storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone for a long
time.
>    Hot data: After writing, many client will read it , maybe it'll delele soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold data. 
>    In some cases, some data in file can be access in high frequency,  but in the same
file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if compress ratio
is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will not access
them. To solve this issue, we provide a mechanism which decompress on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data before compress
by fixed-length (we call these fixed-length data as "chunk"), every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for precise
position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters are idle
at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster busy.
> Will do:
> 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message