hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rodrigo Schmidt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1491) Use HAR filesystem to merge parity files
Date Sun, 14 Feb 2010 09:39:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833542#action_12833542

Rodrigo Schmidt commented on MAPREDUCE-1491:

Dhruba, thanks for reviewing the code.

As for your question, with the current code the .har files are never deleted automatically.
In the scenario you presented, when you delete one of the files, the har file is left as it
is, with all 10 parity files inside. I'm doing that exactly to avoid leaving the other files
with less redundancy.

Besides, if you recreate one of the files, a new parity file is generated outside the har,
but the code on the RaidNode is smart enough to pick the parity file outside har.

The downside of the current patch is that even if all files are deleted or recreated, the
har file is never deleted and new parity files are created outside it. In the future I plan
to fix that and enable the recreation of har files when they become obsolete. I didn't do
that now to keep the code simple enough to be reviewed and deployed quickly.

Besides, the main idea behind using har on raid is to do that for files that won't probably
change in the future (otherwise recreating things becomes too expensive). The code uses a
raid property called time_before_har (on each policy) to decide when the files are old enough
to be hared. Setting this variable properly will avoid wasting space in most practical cases.

Let me know what you think of this.

> Use HAR filesystem to merge parity files 
> -----------------------------------------
>                 Key: MAPREDUCE-1491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1491
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: Rodrigo Schmidt
>            Assignee: Rodrigo Schmidt
>         Attachments: MAPREDUCE-1491.0.patch
> The HDFS raid implementation (HDFS-503) creates a parity file for every file that is
RAIDed. This puts additional burden on the memory requirements of the namenode. It will be
 nice if the parity files are combined together using the HadoopArchive (har) format.
> This was (HDFS-684) before, but raid migrated to MAPREDUCE.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message