hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.
Date Thu, 24 Apr 2008 20:37:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164
] 

mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users
do not use so often. We would like to create an archiving  utility that is able to archive
these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution
for archiving files. Users want to keep their files as distinct  files and would sometime
like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented
later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax *
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input
to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files.
Archives are immutable, so renames, deletes, creates will throw an exception in the initial
versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions
that they initially had. In later versions of HAR, permissions can be stored into the metadata
making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives
and changes to DFSClient that will transparently walk this mount to the real archive and will
allow transparent use of archives.
 
Comments?





      was (Author: mahadev):
    Here is the design for the archives. 

Archiving files in HDFS

-- Motivation-- 

The Namenode is a limited resource and we usually end up with lots of small files that users
do not use so often. We would like to create an archiving  utility that is able to archive
these files which are semi transparent and usable by map reduce. 

-- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution
for archiving files. Users want to keep their files as distinct  files and would sometime
like to unarchive and not lose the file layouts.

-- Requirements-- 
 -- transparent or semi transparent usage of archives. 
 -- Must be able to archive and unarchive in parallel 
 -- Changeable archives is not a requirement but the design should not prevent it to be implemented
later.
 -- Compression is not a goal.

-- Archive Format --
-- Conventional archive formats like tar are not convenient for parallel archive creation

-- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-- URI Syntax -- 
Har FileSystem is a client side filesystem which is semitransparent. 
-- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

-- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input
to map reduce jobs.

-- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files.
Archives are immutable, so renames, deletes, creates will throw an exception in the initial
versions of archives. 

-- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions
that they initially had. In later versions of HAR, permissions can be stored into the metadata
making it possible to unarchive without losing permissions.

-- Future Work:

-- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives
and changes to DFSClient that will transparently walk this mount to the real archive and will
allow transparent use of archives.
 
Comments?




  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message