Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FB4610423 for ; Fri, 12 Jul 2013 23:21:52 +0000 (UTC) Received: (qmail 14329 invoked by uid 500); 12 Jul 2013 23:21:51 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 14294 invoked by uid 500); 12 Jul 2013 23:21:51 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 14285 invoked by uid 99); 12 Jul 2013 23:21:51 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Jul 2013 23:21:51 +0000 Date: Fri, 12 Jul 2013 23:21:51 +0000 (UTC) From: "Suresh Srinivas (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-4672) Support tiered storage policies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4672?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13707= 530#comment-13707530 ]=20 Suresh Srinivas commented on HDFS-4672: --------------------------------------- bq. Today the scope of HDFS-2832 was widened to duplicate this issue. Since= the issues are linked, that was not necessary.=20 I disagree. Here is the brief comment I had posted on that jira - https://i= ssues.apache.org/jira/secure/EditComment!default.jspa?id=3D12539644&comment= Id=3D13192326 {quote} # Support for heterogeneous storages: #* DN could support along with disks, other types of storage such as flash = etc. #* Suitable storage can be chosen based on client preference such as need f= or random reads etc. # Block report scaling: instead of a single monolithic block report, a smal= ler block report per storage becomes possible. This is important with the g= rowth in disk capacity and number of disks per datanode. # Better granularity of storage failure handling: #* DN could just indicate loss of storage and namenode can handle it better= since it knows the list of blocks belonging to a storage.=20 #* DN could locally handle storage failures or provide decommissioning of a= storage by marking a storage as ReadOnly. # Hot pluggability of disks/storages: adding and deleting a storage to a no= de is simplified. # Other flexibility: includes future enhancements to balance storages with = in a datanode, balancing the load (number of transceivers) per storage etc = and better block placement strategies. {quote} It has brief mentions of the following, that is duplicated in this jira: # Client preference for writing to storages - automatically means that bloc= k placement must consider storage type etc. # Support for different storage types in datanode and block reports based o= n that. # Awareness of those storage types at the namenode (not for just block plac= ement with various other benefits) # Affinity of replicas to a storage type. Certainly you have elaborated along these points and more implementation de= tails. Does not mean it is a different jira. =20 > Support tiered storage policies > ------------------------------- > > Key: HDFS-4672 > URL: https://issues.apache.org/jira/browse/HDFS-4672 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs-client, libhdfs, namenode > Reporter: Andrew Purtell > > We would like to be able to create certain files on certain storage devic= e classes (e.g. spinning media, solid state devices, RAM disk, non-volatile= memory). HDFS-2832 enables heterogeneous storage at the DataNode, so the N= ameNode can gain awareness of what different storage options are available = in the pool and where they are located, but no API is provided for clients = or block placement plugins to perform device aware block placement. We woul= d like to propose a set of extensions that also have broad applicability to= use cases where storage device affinity is important: > =20 > - Add an enum of generic storage device classes, borrowing from current t= axonomy of the storage industry > =20 > - Augment DataNode volume metadata in storage reports with this enum > =20 > - Extend the namespace so pluggable block policies can be specified on a = directory and storage device class can be tracked in the Inode. Perhaps thi= s could be a larger discussion on adding support for extended attributes in= the HDFS namespace. The Inode should track both the storage device class h= int and the current actual storage device class. FileStatus should expose t= his information (or xattrs in general) to clients. > =20 > - Extend the pluggable block policy framework so policies can also consid= er, and specify, affinity for a particular storage device class > =20 > - Extend the file creation API to accept a storage device class affinity = hint. Such a hint can be supplied directly as a parameter, or, if we are co= nsidering extended attribute support, then instead as one of a set of xattr= s. The hint would be stored in the namespace and also used by the client to= indicate to the NameNode/block placement policy/DataNode constraints on bl= ock placement. Furthermore, if xattrs or device storage class affinity hint= s are associated with directories, then the NameNode should provide the sto= rage device affinity hint to the client in the create API response, so the = client can provide the appropriate hint to DataNodes when writing new block= s. > =20 > - The list of candidate DataNodes for new blocks supplied by the NameNode= to clients should be weighted/sorted by availability of the desired storag= e device class.=20 > =20 > - Block replication should consider storage device affinity hints. If a c= lient move()s a file from a location under a path with affinity hint X to u= nder a path with affinity hint Y, then all blocks currently residing on med= ia X should be eventually replicated onto media Y with the then excess repl= icas on media X deleted. > =20 > - Introduce the concept of degraded path: a path can be degraded if a blo= ck placement policy is forced to abandon a constraint in order to persist t= he block, when there may not be available space on the desired device class= , or to maintain the minimum necessary replication factor. This concept is = distinct from the corrupt path, where one or more blocks are missing. Paths= in degraded state should be periodically reevaluated for re-replication. > =20 > - The FSShell should be extended with commands for changing the storage d= evice class hint for a directory or file.=20 > =20 > - Clients like DistCP which compare metadata should be extended to be awa= re of the storage device class hint. For DistCP specifically, there should = be an option to ignore the storage device class hints, enabled by default. > =20 > Suggested semantics: > =20 > - The default storage device class should be the null class, or simply th= e =E2=80=9Cdefault class=E2=80=9D, for all cases where a hint is not availa= ble. This should be configurable. hdfs-defaults.xml could provide the defau= lt as spinning media. > =20 > - A storage device class hint should be provided (and is necessary) only = when the default is not sufficient. > =20 > - For backwards compatibility, any FSImage or edit log entry lacking a s= torage device class hint is interpreted as having affinity for the null cla= ss. > =20 > - All blocks for a given file share the same storage device class. If the= replication factor for this file is increased the replicas should all be p= laced on the same storage device class. > =20 > - If one or more blocks for a given file cannot be placed on the required= device class, then the file is marked as degraded. Files in degraded state= should be periodically reevaluated for re-replication.=20 > =20 > - A directory and path can only have one storage device affinity hint. If= the file inode specifies a hint, this is used, otherwise we walk up the pa= th until a hint is found and use that one, otherwise the default storage cl= ass is used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira