hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HDFS-224) I propose a tool for creating and manipulating a new abstraction, Hadoop Archives.
Date Sat, 28 Jan 2012 03:45:11 GMT

     [ https://issues.apache.org/jira/browse/HDFS-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Owen O'Malley resolved HDFS-224.

    Resolution: Duplicate

We have a different version of harchives.
> I propose a tool for creating and manipulating a new abstraction, Hadoop Archives.
> ----------------------------------------------------------------------------------
>                 Key: HDFS-224
>                 URL: https://issues.apache.org/jira/browse/HDFS-224
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Dick King
> -- Introduction
> In some hadoop map/reduce and dfs use cases, including a specific case that arises in
my own work, users would like to populate dfs with a family of hundreds or thousands of directory
trees, each of which consists of thousands of files.  In our case, the trees each have perhaps
20 gigabytes; two or three 3-10-gigabyte files, a thousand small ones, and a large number
of files of intermediate size.  I am writing this JIRA to encourage discussion of a new facility
I want to create and contribute to the dfs core.
> -- The problem
> You can't store such families of trees in dfs in the obvious manner.  The problem is
that the name nodes can't handle the millions or ten million files that result from such a
family, especially if there are a couple of families.  I understand that dfs will not be able
to accommodate tens of millions of files in one instance for quite a while.
> -- Exposed API of my proposed solution
> I would therefore like to produce, and contribute to the dfs core, a new tool that implements
an abstraction called a Hadoop Archive [or harchive].  Conceptually, a harchive is a unit,
but it manages a space that looks like a directory tree.  The tool exposes an interface that
allows a user to do the following:
>  * directory-level operations
>    ** create a harchive [either empty, or initially populated form a locally-stored directory
tree] .  The namespace for harchives is the same as the space of possible dfs directory locators,
and a harchive would in fact be implemented as a dfs directory with specialized contents.
>    ** Add a directory tree to an existing harchive in a specific place within the harchive
>    ** retrieve a directory tree or subtree at or beneath the root of the harchive directory
structure, into a local directory tree
>  * file-level operations
>    ** add a local file to a specific place in the harchive
>    ** modify a file image in a specific place in the harchive to match a local file
>    ** delete a file image in the harchive.
>    ** move a file image within the harchive
>    ** open a file image in the harchive for reading or writing.
>  * stream operations
>    ** open a harchive file image for reading or writing as a stream, in a manner similar
to dfs files, and read or write it [ie., hdfsRead(...) ].  This would include random access
operators for reading.
>  * management operations
>    ** commit a group of changes [which would be made atomically -- there would be no
way half of a change could be made to a harchive if a client crashes].
>    ** clean up a harchive, if it's gotten less performant because of extensive editing
>    ** delete a harchive
> We would also implement a command line interface.
> -- Brief sketch of internals
> A harchive would be represented as a small collection of files, called segments, in a
dfs directory at the harchive's location.  Each segment would contain some of the files of
the harchive's file images in a format to be determined, plus a harchive index.  We may group
files by size, or some other criteria.  It is likely that harchives would contain only one
segment in common cases.
> Changes would be made by adding the text of the new files, either by rewriting an existing
segment that contains not much more data than the size of the changes or by creating a new
segment, complete with a new index.  When dfs comes to be enhanced to allow appends to dfs
files, as requested by HADOOP-1700 , we would be able to take advantage of that.
> Often, when a harchive is initially populated, it could be a single segment, and a file
it contains could be accessed with two random accesses into the segment.  The first access
retrieves the index, and the second access retrieves the beginning of the file.  We could
choose to put smaller files closer to the index to allow lower average amortized costs per
> We might instead choose to represent a harchive as one file or a few files for the large
represented files, and smaller files for the represented smaller files.  That lets us make
modifications by copying at lower cost.
> The segment containing the index is found by a naming convention.  Atomicity is obtained
by creating indices and renaming the files containing them according to the convention, when
a change is committed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message