hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tenghuan He <tenghua...@gmail.com>
Subject Re: Directly reading from datanode using JAVA API got socketTimeoutException
Date Mon, 04 Jan 2016 17:42:10 GMT
Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on
datanode of local machine which does not take up network bandwidth, the
combination reading may have a better performance if the amount of local
data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cnauroth@hortonworks.com>

> I think you can achieve something close to this with just public APIs by
> launching multiple threads, calling FileSystem#open to get a separate input
> stream in each one, and then calling seek to position each stream at a
> different block boundary.  Seek is a cheap operation, basically just
> updating internal offsets.  Seeking forward does not require reading
> through the earlier data byte-by-byte, so you won't pay the cost of
> transferring that part of the data.
> Whether or not this strategy would really improve performance is subject
> to a lot of other factors.  If the application's single-threaded reading
> already saturates the network bandwidth of the NIC, then starting multiple
> threads is unlikely to improve performance.  Those threads will just run
> into contention with each other on the scarce network bandwidth resources.
> If instead the application reads data gradually and performs some
> CPU-intensive processing as it reads, then perhaps the NIC is not
> saturated, and multi-threading could help.
> As usual with performance work, the actual outcomes are going to be highly
> situational.
> I hope this helps.
> --Chris Nauroth
> From: Tenghuan He <tenghuanhe@gmail.com>
> Date: Thursday, December 31, 2015 at 5:17 PM
> To: Chris Nauroth <cnauroth@hortonworks.com>
> Cc: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Re: Directly reading from datanode using JAVA API got
> socketTimeoutException
> The following is what I want to do.
> When reading a big file across multi blocks, I want to read different
> blocks from different node in parallel thus make reading big file faster.
> Is that possible?
> Thanks
> On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cnauroth@hortonworks.com>
> wrote:
>> Your code has connected to a DataNode's TCP port, and the DataNode server
>> side is likely blocked expecting the client to send some kind of request
>> defined in the Data Transfer Protocol.  The client code here does not write
>> a request, so the DataNode server doesn't know what to do.  Instead, the
>> client immediately goes into a blocking read.  Since the DataNode server
>> side doesn't know what to do, it's never going to write any bytes back to
>> the socket connection, and therefore the client eventually times out on the
>> read.
>> Stepping back, please be aware that what you are trying to do is
>> unsupported.  Relying on private implementation details like this is likely
>> to be brittle and buggy.  As the HDFS code evolves in the future, there is
>> no guarantee that what you do here will work the same way in future
>> versions.  There might not even be a connectToDN method in future versions
>> if we decide to do some internal refactoring.
>> If you can give a high-level description of what you want to achieve,
>> then perhaps we can suggest a way to do it through the public API.
>> --Chris Nauroth
>> From: Tenghuan He <tenghuanhe@gmail.com>
>> Date: Wednesday, December 30, 2015 at 9:29 AM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Directly reading from datanode using JAVA API got
>> socketTimeoutException
>> ​Hello,
>> I want to directly read from datanode blocks using JAVA API as the
>> following code, but I got socketTimeoutException
>> I use reflection to call the DFSClient private method connectToDN(...),
>> and get IOStreamPair of in and out, where in is used to read bytes from
>> datanode.
>> The workhorse code is
>> try {
>>     Method connectToDN;
>>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>>     connectToDN.setAccessible(true);
>>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout,
>>     in = new DataInputStream(pair.in);
>>     System.out.println(in.getClass());
>>     byte[] b = new byte[10000];
>>     in.readFully(b);
>> } catch (Exception e) {
>>     e.printStackTrace();
>> }
>> and the exception is
>> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/
>> remote=/]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.FilterInputStream.read(FilterInputStream.java:133)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at BlocksList.main(BlocksList.java:69)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>> Could anyone tell me where the problem is?
>> Thanks & Begards
>> Tenghuan He

View raw message