hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Re: Lowering HDFS socket timeouts
Date Wed, 18 Jul 2012 14:28:23 GMT
Thanks for the response, N.  I could be wrong here, but since this problem is in the HDFS client
code, couldn't I set this dfs.socket.timeout in my hbase-site.xml and it would only affect
hbase connections to hdfs?  I.e. we wouldn't have to worry about affecting connections between
datanodes, etc. 

-- 
Bryan Beaudreault


On Wednesday, July 18, 2012 at 4:38 AM, N Keywal wrote:

> Hi Bryan,
> 
> It's a difficult question, because dfs.socket.timeout is used all over
> the place in hdfs. I'm currently documenting this.
> Especially:
> - it's used for connections between datanodes, and not only for
> connections between hdfs clients & hdfs datanodes.
> - It's also used for the two types of datanodes connection (ports
> beeing 50010 & 50020 by default).
> - It's used as a connect timeout, but as well as a read timeout
> (socket is connected, but the application does not write for a while)
> - It's used with various extensions, so when your seeing stuff like
> 69000 or 66000 it's often the same setting timeout + 3s (hardcoded) *
> #replica
> 
> For a single datanode issue, with everything going well, it will make
> the cluster much more reactive: hbase will go to another node
> immediately instead of waiting. But it will make it much more
> sensitive to gc and network issues. If you have a major hardware
> issue, something like 10% of your cluster going down, this setting
> will multiply the number of retries, and will add a lot of workload to
> your already damaged cluster, and this could make the things worse.
> 
> This said, I think we will need to make it shorter sooner or later, so
> if you do it on your cluster, it will be helpful...
> 
> N.
> 
> On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault
> <bbeaudreault@gmail.com (mailto:bbeaudreault@gmail.com)> wrote:
> > Today I needed to restart one of my region servers, and did so without gracefully
shutting down the datanode. For the next 1-2 minutes we had a bunch of failed queries from
various other region servers trying to access that datanode. Looking at the logs, I saw that
they were all socket timeouts after 60000 milliseconds.
> > 
> > We use HBase mostly as an online datastore, with various APIs powering various web
apps and external consumers. Writes come from both the API in some cases, but we have continuous
hadoop jobs feeding data in as well.
> > 
> > Since we have web app consumers, this 60 second timeout seems unreasonably long.
If a datanode goes down, ideally the impact would be much smaller than that. I want to lower
the dfs.socket.timeout to something like 5-10 seconds, but do not know the implications of
this.
> > 
> > In googling I did not find much precedent for this, but I did find some people talking
about upping the timeout to much longer than 60 seconds. Is it generally safe to lower this
timeout dramatically if you want faster failures? Are there any downsides to this?
> > 
> > Thanks
> > 
> > --
> > Bryan Beaudreault
> > 
> 
> 
> 



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message