hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Radia <sra...@yahoo-inc.com>
Subject Re: Multiplexing sockets in DFSClient/datanodes?
Date Wed, 12 Mar 2008 20:54:37 GMT
Hairong Kuang wrote:
>> streaming IO lets to pipe large amounts
>> of data without the request/response exchange.
>> The worry was that IO performance would degrade.
>>     
>
> Since hadoop-2188 removes ipc timeout, it is ok that a datanode responses a
> datanode up in the pipeline when it gets a response from a datanode down in
> the pipeline. If datanodes could have two threads, one pushing data down to
> the pipeline and one writing it to the local disk, using RPC won't introduce
> any additional communication cost.
>   

I believe that is what our pipe line code does.
The client, however will block for the reply unless we change the client 
code to have multiple buffers etc.
> Hairong
>
> On 3/12/08 11:35 AM, "Sanjay Radia" <sradia@yahoo-inc.com> wrote:
>
>   
>> Doug Cutting wrote:
>>     
>>> Jim Kellerman wrote:
>>>       
>>>> Yes, multiplexing a socket is more complicated than having one socket
>>>> per file, but saving system resources seems like a way to scale.
>>>>
>>>> Questions? Comments? Opinions? Flames?
>>>>         
>>> Note that Hadoop RPC already multiplexes, sharing a single socket per
>>> pair of JVMs.  It would be possible to multiplex datanode, and should
>>> not in theory significantly impact performance, but, as you indicate,
>>> it would be a significant change.  One approach might be to implement
>>> HDFS data access using RPC rather than directly using stream i/o.
>>>
>>> RPC also tears down idle connections, which HDFS does not.  I wonder
>>> how much doing that alone might help your case?  That would probably
>>> be much simpler to implement.  Both client and server must already
>>> handle connection failures, so it shouldn't be too great of a change
>>> to have one or both sides actively close things down if they're idle
>>> for more than a few seconds.  This is related to adding write timeouts
>>> to the datanode (HADOOP-2346).
>>>       
>> Doug,
>>    Dhruba and I had discussed using RPC in the past. While RPC is a
>> cleaner interface and our rpc implementation has
>> features such sharing connection, closing idle connections etc,
>> streaming IO lets to pipe large amounts
>> of data without the request/response exchange.
>> The worry was that IO performance would degrade.
>> BTW, NFS uses rpc (NFS does not have the write pipeline for replicas)
>>
>> sanjay
>>     
>>> Doug
>>>       
>
>   


Mime
View raw message