Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E5BA105C4 for ; Thu, 19 Dec 2013 14:35:28 +0000 (UTC) Received: (qmail 14683 invoked by uid 500); 19 Dec 2013 14:35:14 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 14583 invoked by uid 500); 19 Dec 2013 14:35:11 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 14576 invoked by uid 99); 19 Dec 2013 14:35:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Dec 2013 14:35:09 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.223 as permitted sender) Received: from [206.225.164.223] (HELO hub021-nj-7.exch021.serverdata.net) (206.225.164.223) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Dec 2013 14:35:04 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-7.exch021.domain.local ([10.240.4.114]) with mapi id 14.03.0158.001; Thu, 19 Dec 2013 06:34:43 -0800 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: HDFS short-circuit reads Thread-Topic: HDFS short-circuit reads Thread-Index: Ac76vayamT7sZuxQTIi3YCeJL7BrCwASbpKAAAtomHAAI48CAABBCo2Q Date: Thu, 19 Dec 2013 14:34:41 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B86DF6E72@MBX021-E3-NJ-2.exch021.domain.local> References: <869970D71E26D7498BDAC4E1CA92226B86DF367D@MBX021-E3-NJ-2.exch021.domain.local> <869970D71E26D7498BDAC4E1CA92226B86DF40A1@MBX021-E3-NJ-2.exch021.domain.local> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.60] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B86DF6E72MBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B86DF6E72MBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Ah, I see - thanks for clarifying. john From: Chris Nauroth [mailto:cnauroth@hortonworks.com] Sent: Tuesday, December 17, 2013 4:32 PM To: user@hadoop.apache.org Subject: Re: HDFS short-circuit reads Both of these methods return the same underlying data type that you're ulti= mately interested in. This is the BlockLocation object, which contains the= hosts that have a replica of the block. Depending on your usage pattern, = one of these methods might be more convenient than the other. If your application's input is a single file, then you'll likely find that = getFileBlockLocations is a good fit. This will give you the BlockLocation = information for that one file, and you won't need to write extra code to pu= ll it out of the RemoteIterator (which you know is only going to contain on= e result anyway). If your application's input is a whole directory, and you then process all = files within that directory, then you'll likely find listLocatedStatus to b= e more convenient. You'll be able to make a single RPC call to get all of = the BlockLocation information for all files. (Like you said, one call inst= ead of many.) Chris Nauroth Hortonworks http://hortonworks.com/ On Tue, Dec 17, 2013 at 6:39 AM, John Lilley > wrote: Thanks! I do call FileSytem.getFileBlockLocations() now to map tasks to l= ocal data blocks; is there any advantage to using listLocatedStatus() inste= ad? I guess one call instead of two... John From: Chris Nauroth [mailto:cnauroth@hortonworks.com] Sent: Monday, December 16, 2013 6:07 PM To: user@hadoop.apache.org Subject: Re: HDFS short-circuit reads Hello John, Short-circuit reads are not on by default. The documentation page you link= ed to at hadoop.apache.org contains all of the i= nformation you need to enable them though. Regarding checking status of short-circuit read programmatically, here are = a few thoughts on this: Your application could check Configuration for the dfs.client.read.shortcir= cuit key. This will tell you at a high level if the feature is enabled. H= owever, note that the feature needs to be turned on in configuration for bo= th the DataNode and the HDFS client process. Depending on the details of t= he deployment, the DataNode and the client might be using different configu= ration files. This tells you if the feature is enabled, but it doesn't necessarily tell y= ou if you're really going to get short-circuit reads when you open the file= . There might not be a local replica for the block, in which case the read= would fall back to the typical remote read behavior anyway. Depending on what your application wants to achieve, you might also be inte= rested in looking at the FileSystem.listLocatedStatus API to query informat= ion about blocks and the corresponding locations of replicas. Applications= like MapReduce use this information to try to schedule their work for opti= mal locality. Short-circuit reads then become a further optimization on to= p of the gains already achieved by locality. Hope this helps, Chris Nauroth Hortonworks http://hortonworks.com/ On Mon, Dec 16, 2013 at 4:21 PM, John Lilley > wrote: Our YARN application would benefit from maximal bandwidth on HDFS reads. But I'm unclear on how short-circuit reads are enabled. Are they on by default? Can our application check programmatically to see if the short-circuit read= is enabled? Thanks, john RE: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Shor= tCircuitLocalReads.html https://issues.apache.org/jira/browse/HDFS-347 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to= which it is addressed and may contain information that is confidential, pr= ivileged and exempt from disclosure under applicable law. If the reader of = this message is not the intended recipient, you are hereby notified that an= y printing, copying, dissemination, distribution, disclosure or forwarding = of this communication is strictly prohibited. If you have received this com= munication in error, please contact the sender immediately and delete it fr= om your system. Thank You. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to= which it is addressed and may contain information that is confidential, pr= ivileged and exempt from disclosure under applicable law. If the reader of = this message is not the intended recipient, you are hereby notified that an= y printing, copying, dissemination, distribution, disclosure or forwarding = of this communication is strictly prohibited. If you have received this com= munication in error, please contact the sender immediately and delete it fr= om your system. Thank You. --_000_869970D71E26D7498BDAC4E1CA92226B86DF6E72MBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Ah, I see – thanks = for clarifying.

john

 <= /p>

From: Chris Na= uroth [mailto:cnauroth@hortonworks.com]
Sent: Tuesday, December 17, 2013 4:32 PM
To: user@hadoop.apache.org
Subject: Re: HDFS short-circuit reads

 

Both of these methods return the same underlying dat= a type that you're ultimately interested in.  This is the BlockLocatio= n object, which contains the hosts that have a replica of the block.  = Depending on your usage pattern, one of these methods might be more convenient than the other.

 

If your application's input is a single file, then y= ou'll likely find that getFileBlockLocations is a good fit.  This will= give you the BlockLocation information for that one file, and you won't ne= ed to write extra code to pull it out of the RemoteIterator (which you know is only going to contain one result any= way).

 

If your application's input is a whole directory, an= d you then process all files within that directory, then you'll likely find= listLocatedStatus to be more convenient.  You'll be able to make a si= ngle RPC call to get all of the BlockLocation information for all files.  (Like you said, one call instead of many.= )


Chris Nauroth

Hortonworks

 

 

On Tue, Dec 17, 2013 at 6:39 AM, John Lilley <john.lilley@redp= oint.net> wrote:

Thanks!   I do call FileSytem= .getFileBlockLocations() now to map tasks to local data blocks; is there any advantage to using listLocatedStatus() instead?  I guess one call= instead of two…

John

 

 

From: Chris Nauroth [mailto:= cnauroth@hort= onworks.com]
Sent: Monday, December 16, 2013 6:07 PM
To: user= @hadoop.apache.org
Subject: Re: HDFS short-circuit reads

 

Hello John,

 

Short-circuit reads are not on by default.  The documentation= page you linked to at hadoop.apache.org contains all of the information you need to enable them though.

 

Regarding checking status of short-circuit read programmatically, = here are a few thoughts on this:

 

Your application could check Configuration for the dfs.client.read= .shortcircuit key.  This will tell you at a high level if the feature = is enabled.  However, note that the feature needs to be turned on in configuration for both the DataNode and the HDFS = client process.  Depending on the details of the deployment, the DataN= ode and the client might be using different configuration files.=

 

This tells you if the feature is enabled, but it doesn't necessari= ly tell you if you're really going to get short-circuit reads when you open= the file.  There might not be a local replica for the block, in which case the read would fall back to the typic= al remote read behavior anyway.

 

Depending on what your application wants to achieve, you might als= o be interested in looking at the FileSystem.listLocatedStatus API to query= information about blocks and the corresponding locations of replicas.  Applications like MapReduce use this informat= ion to try to schedule their work for optimal locality.  Short-circuit= reads then become a further optimization on top of the gains already achie= ved by locality.

 

Hope this helps,


Chris Nauroth

Hortonworks

 

 

On Mon, Dec 16, 2013 at 4:21 PM, John Lilley <john.lilley@redpoint.net&g= t; wrote:

Our YARN application would benefit from maximal bandwidth on HDFS = reads.

But I’m unclear on how short-circuit reads are enabled. = ;

Are they on by default? 

Can our application check programmatically to see if the short-cir= cuit read is enabled?

Thanks,=

john

 

RE:

https://hado= op.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLoca= lReads.html

https://issues.apache.org/jira/browse/HDFS-347

 <= /p>

 


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to= which it is addressed and may contain information that is confidential, pr= ivileged and exempt from disclosure under applicable law. If the reader of = this message is not the intended recipient, you are hereby notified that any printing, copying, disseminati= on, distribution, disclosure or forwarding of this communication is strictl= y prohibited. If you have received this communication in error, please cont= act the sender immediately and delete it from your system. Thank You.

 


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to= which it is addressed and may contain information that is confidential, pr= ivileged and exempt from disclosure under applicable law. If the reader of = this message is not the intended recipient, you are hereby notified that any printing, copying, disseminati= on, distribution, disclosure or forwarding of this communication is strictl= y prohibited. If you have received this communication in error, please cont= act the sender immediately and delete it from your system. Thank You.

--_000_869970D71E26D7498BDAC4E1CA92226B86DF6E72MBX021E3NJ2exch_--