Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C4BD101AC for ; Fri, 19 Dec 2014 21:53:27 +0000 (UTC) Received: (qmail 67998 invoked by uid 500); 19 Dec 2014 21:53:26 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 67907 invoked by uid 500); 19 Dec 2014 21:53:26 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 67895 invoked by uid 99); 19 Dec 2014 21:53:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Dec 2014 21:53:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of agundabattula@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Dec 2014 21:53:20 +0000 Received: by mail-ie0-f173.google.com with SMTP id y20so1529077ier.18 for ; Fri, 19 Dec 2014 13:53:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ZLj7E75hkAhOsIljTp7ZvMPgbjmYAXl6gZ4XaZG9c1Q=; b=b5oCz+iOqM3H4TfHvV+zRWqq9e2oVROHhV36F6B5tJqBmpJmS2YoPAOQ0nsatydwQx vukgNk7nYwN5mXtLlG5djoCHq5gpuHpYDpCyJ5/qhNN9zJSITnUekDwwSExzC15DY7fK nciuFu5Gjd9gNWFwnQzbc6eG9bf36EkRtIpWWd025wzxPw4WgABUFSoZSHGVvDGF3+fn YzgcIulrLEZdFOVmaqklw6ACBvdsHjN77CcKzjRjyPFjgF0U8JdNfozSK8HChi4mPo1q 0AaZGgvKONkRnnp3xpEVbZVpXq+cCZH36SqUF2a7Q+9fJjW3mqdllZ53Q2qGIcMeU4vt Zyqg== MIME-Version: 1.0 X-Received: by 10.107.169.170 with SMTP id f42mr9826787ioj.24.1419025979871; Fri, 19 Dec 2014 13:52:59 -0800 (PST) Received: by 10.107.130.167 with HTTP; Fri, 19 Dec 2014 13:52:59 -0800 (PST) In-Reply-To: References: Date: Sat, 20 Dec 2014 08:52:59 +1100 Message-ID: Subject: Re: Controlling the block placement and the file placement in HDFS writes From: Ananth Gundabattula To: hdfs-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11427204751732050a98bba9 X-Virus-Checked: Checked by ClamAV on apache.org --001a11427204751732050a98bba9 Content-Type: text/plain; charset=UTF-8 Hello Zhe, Thanks a lot for the inputs. Storage policies is really what I was looking for one of the problems. @Nick: I agree that it would be a nice feature to have. Thanks for the info. Regards, Ananth On Fri, Dec 19, 2014 at 10:49 AM, Nick Dimiduk wrote: > HBase would enjoy a similar functionality. In our case, we'd like all > replicas for all files in a given HDFS path to land on the same set of > machines. That way, in the event of a failover, regions can be assigned to > one of these other machines that has local access to all blocks for all > region files. > > On Thu, Dec 18, 2014 at 3:36 PM, Zhe Zhang > wrote: > > > > > The second aspect is that our queries are time based and this time > window > > > follows a familiar pattern of old data not being queried much. Hence we > > > would like to preserve the most recent data in the HDFS cache ( impala > is > > > helping us manage this aspect via their command set ) but we would like > > the > > > next recent amount of data chunks to land on an SSD that is present on > > > every datanode. The remaining set of blocks which are "very old but in > > > large quantities" would land on spinning disks. The decision to choose > a > > > given volume is based on the file name as we can control the filename > > that > > > is being used to generate the file. > > > > > > > Have you tried the 'setStoragePolicy' command? It's part of the HDFS > > "Heterogeneous Storage Tiers" work and seems to address your scenario. > > > > > 1. Is there a way to control that all file blocks belonging to a > > particular > > > hdfs directory & file go to the same physical datanode ( and their > > > corresponding replicas as well ? ) > > > > This seems inherently hard: the file/dir could have more data than a > > single DataNode can host. Implementation wise, if requires some sort > > of a map in BlockPlacementPolicy from inode or file path to DataNode > > address. > > > > My 2 cents.. > > > > -- > > Zhe Zhang > > Software Engineer, Cloudera > > https://sites.google.com/site/zhezhangresearch/ > > > --001a11427204751732050a98bba9--