hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Kellerman <...@powerset.com>
Subject RE: Multiplexing sockets in DFSClient/datanodes?
Date Fri, 14 Mar 2008 20:00:47 GMT
I'm not suggesting doing simultaneous transfers, just having one connection between any one
client and any one data node. My thinking was each transfer would be queued and then processed
one at a time.

This is a big problem for us. On our cluster at Powerset, we have had both datanodes and HBase
region servers run out of file handles because there is one open per file.

As HBase installations get larger one socket per file just won't scale.

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: dhruba Borthakur [mailto:dhruba@yahoo-inc.com]
> Sent: Friday, March 14, 2008 10:53 AM
> To: core-dev@hadoop.apache.org; hadoop-dev@lucene.apache.org
> Subject: RE: Multiplexing sockets in DFSClient/datanodes?
>
> Hi Jim,
>
> The protocol between the client and the Datanodes will become
> relatively more complex if we decide to multiplex
> simultaneous transfers of multiple blocks on the same socket
> connection. Do you think that the benefit of saving on system
> resources is really appreciable?
>
> Thanks,
> Dhruba
>
> -----Original Message-----
> From: Sanjay Radia [mailto:sradia@yahoo-inc.com]
> Sent: Wednesday, March 12, 2008 11:36 AM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: Multiplexing sockets in DFSClient/datanodes?
>
> Doug Cutting wrote:
> > Jim Kellerman wrote:
> >> Yes, multiplexing a socket is more complicated than having
> one socket
> >> per file, but saving system resources seems like a way to scale.
> >>
> >> Questions? Comments? Opinions? Flames?
> >
> > Note that Hadoop RPC already multiplexes, sharing a single
> socket per
> > pair of JVMs.  It would be possible to multiplex datanode,
> and should
> > not in theory significantly impact performance, but, as you
> indicate,
> > it would be a significant change.  One approach might be to
> implement
> > HDFS data access using RPC rather than directly using stream i/o.
> >
> > RPC also tears down idle connections, which HDFS does not.
> I wonder
> > how much doing that alone might help your case?  That would
> probably
> > be much simpler to implement.  Both client and server must already
> > handle connection failures, so it shouldn't be too great of
> a change
> > to have one or both sides actively close things down if
> they're idle
> > for more than a few seconds.  This is related to adding
> write timeouts
>
> > to the datanode (HADOOP-2346).
>
> Doug,
>    Dhruba and I had discussed using RPC in the past. While
> RPC is a cleaner interface and our rpc implementation has
> features such sharing connection, closing idle connections
> etc, streaming IO lets to pipe large amounts of data without
> the request/response exchange.
> The worry was that IO performance would degrade.
> BTW, NFS uses rpc (NFS does not have the write pipeline for replicas)
>
> sanjay
> >
> > Doug
>
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 7.5.519 / Virus Database: 269.21.7/1329 - Release
> Date: 3/14/2008 12:33 PM
>
>

No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.21.7/1329 - Release Date: 3/14/2008 12:33 PM


Mime
View raw message