hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HdfsFutures" by SanjayRadia
Date Fri, 28 Mar 2008 00:32:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SanjayRadia:
http://wiki.apache.org/hadoop/HdfsFutures

The comment on the change is:
  

------------------------------------------------------------------------------
  
  '''The following page is under development - it is being converted from TWiki to Wiki'''
  
- == Goal: Make HDFS Real ==
+ == Goal: HDFS for Production Use ==
  
     1. Reliable and Secure: The file system is solid enough for user to feel comfortable
to use in "production"
           * Availability and integrity of HDFS is good enough
@@ -15, +15 @@

              * Availability if the file data and its integrity
           * Secure
              * Access control - in 0.16
-             * Secure authentication  0.18?)
+             * Secure authentication  0.19
     1. Good Enough Performance: HDFS should not limit the scaling of the Grid and the utilization
of the nodes in the Grid
        * Handle large number of files
        * Handle large number of clients
-       * Low latency of HDFS operation - this will affect the utilization of the client nodes<br
/>
+       * Low latency of HDFS operation - this will affect the utilization of the client nodes
        * High throughput of HDFS operations
-    1. Rich Enough FS Features for applications<br />
+    1. Rich Enough FS Features for applications
        * e.g. append
        * e.g. good performance for random IO
     1. Sufficient Operations and Management features to manage large 4K Cluster
        * Easy to configure, upgrade etc
        * BCP, snapshots, backups
-       *
+      
  
  == Service Scaling ==
  
@@ -37, +37 @@

     * scale the performance of the name service - i.e. its throughput and latency and in
particular the number of concurrent clients
  Improving one  may improve the other.
     * E.g. moving Block map functionality to slave NNs will free up storage for Name entries
-    * E.g. Paritioning name space can also improve performance of each NN slave<br />
+    * E.g. Paritioning name space can also improve performance of each NN slave
  
  === Summary of various options that scale name space and performance (details below) ===
-  (Also see [http://twiki.corp.yahoo.com/pub/Grid/HdfsFeaturePlanning/ScaleNN_Sea_of_Options.pdf/
Scaling NN: Sea of Options])
+  (Also see attachment:ScaleNN_Sea_of_Options.pdf)
     * Grow memory
        * Scales name space but not performance
        * Issue: GC and Java scaling for large memories
@@ -59, +59 @@

  
     * Distribute/Partition/Replicate the NN functionality across multiple computers
        * Read-only replicas of the name node
-          * What is the ratio of Rs to Ws - get data from Simon<br />
+          * What is the ratio of Rs to Ws - get data from Simon
           * Note: RO replicas can be useful for the HA solution and for checkpoint rolling
-          * Add pointer to the RO replicas design we did on the white board. TBD<br />
  
        * Partition by function (also scales namespace and addressible storage space)
           * E.g. move block management and processing to slave NN.
@@ -71, +70 @@

           * this helps in scaling the performance of NN and also the Name space scaling
  
     * RPC and Timeout issues
-       * When load spikes occur, the clients timeout and the spiral of death occurs<br
/>
+       * When load spikes occur, the clients timeout and the spiral of death occurs
        * See [#Hadoop_Protocol_RPC/ Hadoop Protocol RPC]
  
     * Higher concurrency in Namespace access (more sophisticated Namespace locking)
        * This is probably an issue only on NN restart, not during normal operation
        * Improving concurrency is hard since it will require redesign and testing
-          * Better to do this when NN is being redesigned for other reasons.<br />
+          * Better to do this when NN is being redesigned for other reasons.
  
     * Journaling and  Sync
-       * *Benefits*: improves latency, client utilization, less timeouts, greater throughput<br
/>
+       * *Benefits*: improves latency, client utilization, less timeouts, greater throughput
        * Improve Remote syncs
           * Approach 1 - NVRM NFS file system - investigate this
-          * Approach 2 - If flush on NFS pushes the data to the NFS server, this may be good
eough if there is a local sync - investigate<br />
+          * Approach 2 - If flush on NFS pushes the data to the NFS server, this may be good
eough if there is a local sync - investigate
-       * Lazy syncs - need to investigate the benefit and cost (latency)<br />
+       * Lazy syncs - need to investigate the benefit and cost (latency)
           * Delay the reply by a few milliseconds to take allow for more bunching of syncs
-          * This increases the latency<br />
+          * This increases the latency
        *  NVRAM for journal
-       * Async sysncs [No!!!]<br />
+       * Async sysncs [No!!!]
           * reply as soon as memory is updated
           * This changes semantics
-             * - note even though UNIX like this, the failure of machine implies failure
of client and fs *together*
+             * If it is good enough for Unix then isn't it good enough for HDFS?
+                * For a single machine, its failure implies failure of client and fs *together*
-             * In a distributed file system, there is partial failure; further more one expects
HA'ed NN to not loose data.<b><br /></b>
+                * In a distributed file system, there is partial failure; further more one
expects HA'ed NN to not loose data
-       * Issue: if async syncs then change in semantics - this an issue What about HA?
  
     * Move more functionality to data node
           * Distributed replica creation  - not simple
  
     *  Improve Block report processing [https://issues.apache.org/jira/browse/HADOOP-2448/
HADOOP-2448]
-            2K nodes mean a block report every 3 sec.<br />
+            2K nodes mean a block report every 3 sec.
        * Currently: Each DN sends Full BR are sent as array of longs every hour. Initial
BR has random backoff (configurable)
        * Incremental and Event based B-reports - [https://issues.apache.org/jira/browse/HADOOP-1079/
HADOOP-1079]
           * E.g when disk is lost. or blocks are deleted, etc
           * DN can determine what if anything has changed and send only of there are changes
        * Send only checksums
-          * NN recalculates the checksum, OR has rolling checksum<br />
+          * NN recalculates the checksum, OR has rolling checksum
        * Make intial block report's random backoff to be dynamicaly set via NN when DNs register.
 -  [https://issues.apache.org/jira/browse/HADOOP-2444/ HADOOP-2444]
  
  
-    * <br />
+ 
  
  === Scaling Namespace (i.e. number of files/dirs) ===
  
-    *    Partition/distribute Name node (will also help performance)
+ Partition/distribute Name node (will also help performance)
+    [[BR]]Several Options:
     * Partition the namespace hierarchically and mount the volumes
        * In this scheme, there are multiple namespace volumes in a cluster.
        * All the name space volumes share the physical block storage (i.e. One storage pool)
-       * Optionally All namesspaces (ie volumes) are mounted at top level using an automounter
like approach
+       * Optionally All namespaces (ie volumes) are mounted at top level using an automounter
like approach
        * A namepace can be explicitly mounted on to a node in another namename (a la mount
in Posix)
           * Note the Cepf file system [ref] partitions automatically and mounts the partition
  
-    * Partition by a hash function - clients go to one NN by hashing the name<br />
+    * Partition by a hash function - clients go to one NN by hashing the name
  
     *    Only keep part of the namespace in memory.
        * This like a tradional file system where the entire namepsace is stored in secondary
and page-in as needed.
@@ -137, +137 @@

     * Keep 2 generations of fsimage - checkpoint deamon is verifying the fsimage each time
it creates the new one.
     *     CRC- for fsimage and journal
     * Make the NN persistent data solid
-       * add internal consistency counters - to detect bugs in our code<br />
+       * add internal consistency counters - to detect bugs in our code
-          * Num files, Num dirs, num blocks, sentenials between fields, strong lengths<br
/>
+          * Num files, Num dirs, num blocks, sentenials between fields, strong lengths
     * Recycling of block-Ids - problems if old data nodes come back - fix has been deisgned
     * If failure in FSImage, recover from alternate fsimages
     *    Versioning of NN persistent data (use jute)
     * Smart fsck
        * Bad entry in journal - ignore rest
        * Bad entry in journal - ignore only those remaining entry not effected (Hard)
-       * If multiple journals, recover from the best one or merge the entries<br />
+       * If multiple journals, recover from the best one or merge the entries
-       * NN has flag about whether to continue on such an error<br />
+       * NN has flag about whether to continue on such an error
     * Recreating NN data from DN will require fundamental changes in design
  
  === Faster Startup ===
@@ -156, +156 @@

     * Reload FS Image faster
  
  === Restart and Failover ===
-    *    Automatic NN restart on NN failure (We will lets the operations folks handle this)<br
/>
+    *    Automatic NN restart on NN failure (operations can add stander watchdog for this)
     *    Hot standby &amp; failover
  
  == Security: Authorization and ACLs ==
  
  
-    *    0.16 has access control without authorization (i.e. trust client)
+    *    0.16 has access control with very weak authorization
-    *    Security Use cases and requirements (2008 - Q1)<br />
+       * Client side grabs OS user id and passes it to NN
-    *    Secure authorization (2008 - Q2, Q3)
+    *    Secure authorization 0.19
     *    Service-level access control - ie which user can access the HDFS service (as opposed
ACLs for specific files)
  
  == File Features ==
     * File data visible as flushed
        *   Motivation: logging and tail -f
-    * Open files are NOT accessible by readers in the event of deletion or renaming
+    * Currently if an open files is renamed or deleted, the client with the open file can
get an exception 
+       * We are unlikely to fix this as keeping a list of open files on NN side is probably
too expensive.
     * Growable Files
        *    via atomic append with multiple writers
        * Via append with 1 writer [http://issues.apache.org/jira/browse/HADOOP-1700/ Hadoop-1700]
@@ -178, +179 @@

        * Use case for this?
        * note truncate and append needs to be designed together
     * Concatenate files
-       * May reduce name space growth if apps merge small files into a larger one
-       * To make this work well, we will need to support variable length blocks
-       * Is there a way for the map-reduce framework to take advantage of this feature automatically?
+       * Here multiple files are concatenated by merging their block lists (ie not data is
copied)
+       * This will require support forvariable length block.
+       * Reduces number of names but since # of blocks are same does not offer much name
space scaling
     * Support multiple writers for log file
        * Alternatives
           1 logging toolkit that adapts to Hadood
@@ -199, +200 @@

  
  
  == Namespace Features ==
-    * Hardlinks
+    * Hardlinks (not realy needed )
        * Will need add file-ids to make this work
     *  Symbolic links
     *    Native mounts - mount Hadoop on Linux
@@ -211, +212 @@

     * Notion of a fileset - kind of like a directory, except that it cannot contain directories
- under discussion - has potential to scale name node.
  
  
- == File Data Integrity (For NN see NN Availability) ==
+ == File Data Integrity (For NN see NN data integrity above) ==
  
  
     * Periodic data verification
@@ -230, +231 @@

     * NN data rollback
        * this depends on keeping old fsImage/journals around
        * A startup parameter for this?
-       * Need an advisory on how much data will be discarded so that operator can make an
intelligent decision<br />
+       * Need an advisory on how much data will be discarded so that operator can make an
intelligent decision
  
     * Snapshots
        * We allow snapshots only when system is offline
@@ -247, +248 @@

  [[Anchor(Hadoop_Protocol_RPC)]]
  
  === RPC Timeouts, Connection handling, Q handling, threading ===
-    * When load spikes occur, the clients timeout and the spiral of death occurs<br />
+    * When load spikes occur, the clients timeout and the spiral of death occurs
     * Remove Timeout, Instead Ping to detect server failures [http://issues.apache.org/jira/browse/HADOOP-2188/
HADOOP-2188]
     * Improve Connection handling, idle connections etc
  
@@ -273, +274 @@

     *    For HDFS and Map-Reduce
  
  == Diagnosability ==
-    *     NN - what do we need here - log analysis?<br />
+    *     NN - what do we need here - log analysis?
     *    DN - what do we need here?
  
  == Development Support ==
@@ -288, +289 @@

  
     * Support for keeping data in sync across data-centers
  
+ == Attachments ==
+ [[AttachList]]
+ 

Mime
View raw message