hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2542) Transparent compression storage in HDFS
Date Wed, 09 Nov 2011 16:55:51 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147152#comment-13147152
] 

Robert Joseph Evans commented on HDFS-2542:
-------------------------------------------

To Jinglong:
I agree completely with you, I just wanted to be sure that any final solution provided a generic
solution.  Something that can cleanly separate out classification of hot vs. cold vs really
cold data from any extra processing that might happen when data goes from one classification
to another.  Access time is a great start, but I can imagine a lot of potential innovation
and experimentation in this area.  I can also see lots of different groups wanting to do something
when the classification changes.  Like you said, compress the data, possibly move it to a
different disk, possibly apply RAID to it.  What ever we do it should be something that is
also pluggable. 
                
> Transparent compression storage in HDFS
> ---------------------------------------
>
>                 Key: HDFS-2542
>                 URL: https://issues.apache.org/jira/browse/HDFS-2542
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: jinglong.liujl
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs by compression.
Different from HDFS-2115, this issue focus on compress storage. Some idea like below:
> To do:
> 1. compress cold data.
>    Cold data: After writing (or last read), data has not touched by anyone for a long
time.
>    Hot data: After writing, many client will read it , maybe it'll delele soon.
>    
>    Because hot data compression is not cost-effective,  we only compress cold data. 
>    In some cases, some data in file can be access in high frequency,  but in the same
file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>    To specify high/low compress ratio, we should try to compress data, if compress ratio
is too low, we'll never compress them.
> 2. forward compatibility.
>     After compression, data format in datanode has changed. Old client will not access
them. To solve this issue, we provide a mechanism which decompress on datanode.
> 3. support random access and append.
>    As HDFS-2115, random access can be support by index. We separate data before compress
by fixed-length (we call these fixed-length data as "chunk"), every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for precise
position.   
> 4. async compress to avoid compression slow down running job.
>    In practice, we found the cluster CPU usage is not uniform. Some clusters are idle
at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster busy.
> Will do:
> 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message