Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8197618EB4 for ; Mon, 4 Jan 2016 17:42:24 +0000 (UTC) Received: (qmail 11943 invoked by uid 500); 4 Jan 2016 17:42:20 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 11852 invoked by uid 500); 4 Jan 2016 17:42:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 11841 invoked by uid 99); 4 Jan 2016 17:42:19 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jan 2016 17:42:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1BA0C180503 for ; Mon, 4 Jan 2016 17:42:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.882 X-Spam-Level: ** X-Spam-Status: No, score=2.882 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9ckmSMMfzbli for ; Mon, 4 Jan 2016 17:42:11 +0000 (UTC) Received: from mail-vk0-f53.google.com (mail-vk0-f53.google.com [209.85.213.53]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 01EAD429C1 for ; Mon, 4 Jan 2016 17:42:11 +0000 (UTC) Received: by mail-vk0-f53.google.com with SMTP id a188so254208236vkc.0 for ; Mon, 04 Jan 2016 09:42:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=uoT0SXF73+JXLV5KemHqNAYRVliS+GfeDuehc+Fh0ug=; b=qga2R2Z9EHxnfztFdS3J9yJJrZkl8CmpoDxcUFG3Y2pN6+FqPGMLZuF4vAHvIme3Ut nxG1mmXqt3i2xAu8YFD3TTlnVsw69XWzkndWr67w9kLPm5hTXg7nU1WBvKz1jzGLnHc7 +sQuXL+Y8PLkNfCZNhfTplYiHwhlxS+/qI5LGcjtok8dfEoCb+W2mWuQDDVTzDLfn2kI idAuuLEnRog3pEhcsGxgJW1QmHkkjWMhHo2gnzVprh3QudihEYopHOSrPjLqPb3UbOfE vVGw2WwZX3kauZubQZByuR4n/Wg+vSyA2uZLEA9p5Vi+KJSAGnhKFpRYnlW/cAEK1Dvg nM+A== MIME-Version: 1.0 X-Received: by 10.31.179.80 with SMTP id c77mr55141280vkf.50.1451929330591; Mon, 04 Jan 2016 09:42:10 -0800 (PST) Received: by 10.31.66.133 with HTTP; Mon, 4 Jan 2016 09:42:10 -0800 (PST) In-Reply-To: References: Date: Tue, 5 Jan 2016 01:42:10 +0800 Message-ID: Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException From: Tenghuan He To: Chris Nauroth Cc: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11435deefd2aad052885a32c --001a11435deefd2aad052885a32c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Chris Your answer helps me a lot! And I got another idea. If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data. Does this make sense? Tenghuan He On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth wrote: > I think you can achieve something close to this with just public APIs by > launching multiple threads, calling FileSystem#open to get a separate inp= ut > stream in each one, and then calling seek to position each stream at a > different block boundary. Seek is a cheap operation, basically just > updating internal offsets. Seeking forward does not require reading > through the earlier data byte-by-byte, so you won't pay the cost of > transferring that part of the data. > > Whether or not this strategy would really improve performance is subject > to a lot of other factors. If the application's single-threaded reading > already saturates the network bandwidth of the NIC, then starting multipl= e > threads is unlikely to improve performance. Those threads will just run > into contention with each other on the scarce network bandwidth resources= . > If instead the application reads data gradually and performs some > CPU-intensive processing as it reads, then perhaps the NIC is not > saturated, and multi-threading could help. > > As usual with performance work, the actual outcomes are going to be highl= y > situational. > > I hope this helps. > > --Chris Nauroth > > From: Tenghuan He > Date: Thursday, December 31, 2015 at 5:17 PM > To: Chris Nauroth > Cc: "user@hadoop.apache.org" > Subject: Re: Directly reading from datanode using JAVA API got > socketTimeoutException > > The following is what I want to do. > When reading a big file across multi blocks, I want to read different > blocks from different node in parallel thus make reading big file faster. > Is that possible? > > Thanks > > On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth > wrote: > >> Your code has connected to a DataNode's TCP port, and the DataNode serve= r >> side is likely blocked expecting the client to send some kind of request >> defined in the Data Transfer Protocol. The client code here does not wr= ite >> a request, so the DataNode server doesn't know what to do. Instead, the >> client immediately goes into a blocking read. Since the DataNode server >> side doesn't know what to do, it's never going to write any bytes back t= o >> the socket connection, and therefore the client eventually times out on = the >> read. >> >> Stepping back, please be aware that what you are trying to do is >> unsupported. Relying on private implementation details like this is lik= ely >> to be brittle and buggy. As the HDFS code evolves in the future, there = is >> no guarantee that what you do here will work the same way in future >> versions. There might not even be a connectToDN method in future versio= ns >> if we decide to do some internal refactoring. >> >> If you can give a high-level description of what you want to achieve, >> then perhaps we can suggest a way to do it through the public API. >> >> --Chris Nauroth >> >> From: Tenghuan He >> Date: Wednesday, December 30, 2015 at 9:29 AM >> To: "user@hadoop.apache.org" >> Subject: Directly reading from datanode using JAVA API got >> socketTimeoutException >> >> =E2=80=8BHello, >> >> I want to directly read from datanode blocks using JAVA API as the >> following code, but I got socketTimeoutException >> >> I use reflection to call the DFSClient private method connectToDN(...), >> and get IOStreamPair of in and out, where in is used to read bytes from >> datanode. >> The workhorse code is >> >> try { >> Method connectToDN; >> Class[] paraList =3D {DatanodeInfo.class, int.class, LocatedBlock.cl= ass}; >> connectToDN =3D dfsClient.getClass().getDeclaredMethod("connectToDN"= , paraList); >> connectToDN.setAccessible(true); >> IOStreamPair pair =3D (IOStreamPair) connectToDN.invoke(dfsClient, d= atanode, timeout, lb); >> in =3D new DataInputStream(pair.in); >> System.out.println(in.getClass()); >> byte[] b =3D new byte[10000]; >> in.readFully(b); >> } catch (Exception e) { >> e.printStackTrace(); >> >> } >> >> and the exception is >> >> java.net.SocketTimeoutException: 11000 millis timeout while waiting for >> channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=3D/192.168.179.1:53765 >> remote=3D/192.168.179.135:50010] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:= 164) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) >> at java.io.FilterInputStream.read(FilterInputStream.java:133) >> at java.io.DataInputStream.readFully(DataInputStream.java:195) >> at java.io.DataInputStream.readFully(DataInputStream.java:169) >> at BlocksList.main(BlocksList.java:69) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav= a:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor= Impl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:497) >> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)= =E2=80=8B >> >> Could anyone tell me where the problem is? >> >> Thanks & Begards >> >> Tenghuan He >> > > --001a11435deefd2aad052885a32c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Chris
Your answer helps me a lot!
And I got another i= dea.
If launching another thread using short-circuit local reads to read d= ata on datanode of local machine which does not take up network bandwidth, = the combination reading may have a better performance if the amount of loca= l data is comparable to remote data.
Does this make sense?

Tenghuan= He

On= Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cnauroth@hortonworks= .com> wrote:
I think you can achieve something close to this with just public APIs = by launching multiple threads, calling FileSystem#open to get a separate in= put stream in each one, and then calling seek to position each stream at a = different block boundary.=C2=A0 Seek is a cheap operation, basically just updating internal offsets.=C2=A0 Seek= ing forward does not require reading through the earlier data byte-by-byte,= so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subje= ct to a lot of other factors.=C2=A0 If the application's single-threade= d reading already saturates the network bandwidth of the NIC, then starting= multiple threads is unlikely to improve performance.=C2=A0 Those threads will just run into contention with each o= ther on the scarce network bandwidth resources.=C2=A0 If instead the applic= ation reads data gradually and performs some CPU-intensive processing as it= reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be hi= ghly situational.

I hope this helps.

--Chris Nauroth<= /font>

From: Tenghuan He <tenghuanhe@gmail.com> Date: Thursday, December 31, 2015 a= t 5:17 PM
To: Chris Nauroth <cnauroth@hortonworks.com>
Cc: "
user@hadoop.apache.org" <user@hadoop.apache= .org>
Subject: Re: Directly reading from = datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big = file across multi blocks, I want to read different blocks from different no= de in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <= span dir=3D"ltr"> <cnauroth@= hortonworks.com> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode= server side is likely blocked expecting the client to send some kind of re= quest defined in the Data Transfer Protocol.=C2=A0 The client code here doe= s not write a request, so the DataNode server doesn't know what to do.=C2=A0 Instead, the client immediately goes in= to a blocking read.=C2=A0 Since the DataNode server side doesn't know w= hat to do, it's never going to write any bytes back to the socket conne= ction, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsup= ported.=C2=A0 Relying on private implementation details like this is likely= to be brittle and buggy.=C2=A0 As the HDFS code evolves in the future, the= re is no guarantee that what you do here will work the same way in future versions.=C2=A0 There might not even be a conn= ectToDN method in future versions if we decide to do some internal refactor= ing.

If you can give a high-level description of what you want to achieve, = then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth<= /font>

From: Tenghuan He <tenghuanhe@gmail.com> Date: Wednesday, December 30, 2015 = at 9:29 AM
To: "user@hadoop.apache.org" <user@hadoop.apache= .org>
Subject: Directly reading from data= node using JAVA API got socketTimeoutException

=E2= =80=8BHello,

I wa= nt to directly read from datanode blocks using JAVA API as the following co= de, but I got socketTimeoutException

I us= e reflection to call the DFSClient private method connectToDN(...), and get= IOStreamPair of in and out, where in is used to read bytes from datanode.<= /div>
The = workhorse code is
try {
Method conn= ectToDN;
Class[] paraList =3D {DatanodeInfo.class, int.class, LocatedBlock.class};
connectToDN =3D dfsClient.getClass().getDe= claredMethod("conn= ectToDN", paraList);
connectToDN.setAccessible(true);
IOStreamPair pair =3D (IOStreamPair) co= nnectToDN.invoke(dfsClient, datanode, timeout, lb);
in =3D new DataInputStream(pair.in);
System.out.println(in.ge= tClass());
byte<= /span>[] b =3D new byte= [10000];
in.readFul= ly(b);
} catch (Exception e) {
e.printStackTrace();
}

and = the exception is

java= .net.SocketTimeoutException: 11000 millis timeout while waiting for channel= to be ready for read. ch : java.nio.channels.SocketChannel[connected local= =3D/192.168.179.1:= 53765 remote=3D/192.1= 68.179.135:50010]
at org.apache.hadoop.net.SocketIOWi= thTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInpu= tStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInpu= tStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(F= ilterInputStream.java:133)
at java.io.DataInputStream.readFull= y(DataInputStream.java:195)
at java.io.DataInputStream.readFull= y(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:= 69)
at sun.reflect.NativeMethodAccessor= Impl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessor= Impl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAcce= ssorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(= Method.java:497)
at com.intellij.rt.execution.applic= ation.AppMain.main(AppMain.java:140)=E2=80=8B

Coul= d anyone tell me where the problem is?

Than= ks & Begards

Teng= huan He


--001a11435deefd2aad052885a32c--