hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3307) Archives in Hadoop.
Date Sun, 25 May 2008 03:07:56 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_1.patch

this patch addresses the archives isssue. 

This patch includes the following -- 

- har:///user/mahadev/foo.har 

denotes a Hadoop archive. This is default uri which will use the default underlying filesystem
specififed in your conf. 

In case you want to be explicit or some other hdfs (not the defautlt one )

then the uri is -- 

har://hdfs-host:port/user/mahadev/foo.har

The uri's have an implicit assumption on which part of the uri denotes the directory for 
hadoop archives. The code looks the path from the end and assumes the part matching *.har
to be the directory that is the archive.


- it has a filesystem layer so all the commands like 

hadoop fs -ls har:///user/mahadev/foo.har 

work. Most of the mutating commands are not implemented in the archives. -cat -copytolocal
work as expected. 

- works with map reduce. 

so the input to a map reduce job could be har:///user/mahadev/foo.har and this would work
fine.

Code Design and explanation - 

- There are two index files _index file contains files of the form 
  filename <dir>/<file> partfile startindex size childpathnames_if_directory.
  The _index file is sorted by hashcode of filenames.
  The second index file _masterindex contains pointers into the index file to speed up the
lookuptime of files inside the _index file. 

- To create an archive user need to run 
  bin/hadoop archives -archiveName foo.har inputpaths outputdir
 
  This is a map reduce job wherein all the files are distributed amongst the maps which create
part files of around 2GB or so. The reduce then get the startindex and size ffrom the maps
for all the files and creates the _index and _masterindex. 

- Permissions are not persisted. So the permissions returned by the Har filesystem are the
same as those of index files. 



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message