hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Keywal <nkey...@gmail.com>
Subject Re: Lowering HDFS socket timeouts
Date Wed, 18 Jul 2012 16:44:53 GMT
I don't know. The question is mainly for the read time out: you will
connect to the ipc.Client with a read timeout of let say 10s. Server
side the implementation may do something with another server, with a
connect & read timeout of 60s. So if you have:
HBase --> live DN --> dead DN

The timeout will be triggered in HBase while the live DN is still
waiting for the answer from the dead dn. It could even retry on
another node.
 On paper, this should work, as this could happen in real life without
changing the dfs timeouts.. And may be this case does not even exist.
But as the extension mechanism is designed to add some extra seconds,
it could exist for this reason or something alike. Worth asking on the
hdfs mailing list I would say.

On Wed, Jul 18, 2012 at 4:28 PM, Bryan Beaudreault
<bbeaudreault@hubspot.com> wrote:
> Thanks for the response, N.  I could be wrong here, but since this problem is in the
HDFS client code, couldn't I set this dfs.socket.timeout in my hbase-site.xml and it would
only affect hbase connections to hdfs?  I.e. we wouldn't have to worry about affecting connections
between datanodes, etc.
> --
> Bryan Beaudreault
> On Wednesday, July 18, 2012 at 4:38 AM, N Keywal wrote:
>> Hi Bryan,
>> It's a difficult question, because dfs.socket.timeout is used all over
>> the place in hdfs. I'm currently documenting this.
>> Especially:
>> - it's used for connections between datanodes, and not only for
>> connections between hdfs clients & hdfs datanodes.
>> - It's also used for the two types of datanodes connection (ports
>> beeing 50010 & 50020 by default).
>> - It's used as a connect timeout, but as well as a read timeout
>> (socket is connected, but the application does not write for a while)
>> - It's used with various extensions, so when your seeing stuff like
>> 69000 or 66000 it's often the same setting timeout + 3s (hardcoded) *
>> #replica
>> For a single datanode issue, with everything going well, it will make
>> the cluster much more reactive: hbase will go to another node
>> immediately instead of waiting. But it will make it much more
>> sensitive to gc and network issues. If you have a major hardware
>> issue, something like 10% of your cluster going down, this setting
>> will multiply the number of retries, and will add a lot of workload to
>> your already damaged cluster, and this could make the things worse.
>> This said, I think we will need to make it shorter sooner or later, so
>> if you do it on your cluster, it will be helpful...
>> N.
>> On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault
>> <bbeaudreault@gmail.com (mailto:bbeaudreault@gmail.com)> wrote:
>> > Today I needed to restart one of my region servers, and did so without gracefully
shutting down the datanode. For the next 1-2 minutes we had a bunch of failed queries from
various other region servers trying to access that datanode. Looking at the logs, I saw that
they were all socket timeouts after 60000 milliseconds.
>> >
>> > We use HBase mostly as an online datastore, with various APIs powering various
web apps and external consumers. Writes come from both the API in some cases, but we have
continuous hadoop jobs feeding data in as well.
>> >
>> > Since we have web app consumers, this 60 second timeout seems unreasonably long.
If a datanode goes down, ideally the impact would be much smaller than that. I want to lower
the dfs.socket.timeout to something like 5-10 seconds, but do not know the implications of
>> >
>> > In googling I did not find much precedent for this, but I did find some people
talking about upping the timeout to much longer than 60 seconds. Is it generally safe to lower
this timeout dramatically if you want faster failures? Are there any downsides to this?
>> >
>> > Thanks
>> >
>> > --
>> > Bryan Beaudreault
>> >

View raw message