hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Keywal <nkey...@gmail.com>
Subject Re: hbase mttr vs. hdfs
Date Mon, 16 Jul 2012 17:08:35 GMT
And to continue on this, for the files still opened (i.e. our wal
files), we've got two calls to the dead DN:

one, during the input stream opening, from DFSClient#updateBlockInfo.
This calls fails, but the exception is shallowed without being logged.
The node info is not updated, but there is no error, so we continue
without the right info. The timeout will be 60 seconds. This call is
one the port 50020.
the second, will be the one already mentioned for the data transfer,
with the timeout of 69 seconds. The dead nodes list is not updated by
the first failure, leading to a total wait time >2 minutes if we got
directed to the bad location.

On Mon, Jul 16, 2012 at 2:00 PM, N Keywal <nkeywal@gmail.com> wrote:
> I found another solution, better than the workaround I was previously
> mentionning, that could be implemented in the DFS client or the
> namenode:
> The NN returns a set of ordered DN. We could open this ordering. For
> an hlog file, if there is a DN on the same node as the dead RS, this
> DN would get the lowest priority. HBase would just need the file name
> of the block to make this decision.
> Advantages are:
> - this part is already centralized in hdfs namenode. To do it cleanly
> it requires publishing racks & node distribution in an interface; but
> I hope it's possible if not already done.
> - it can be also put in the DFSClient, and this solves the issues
> mentioned by Andrew: the customization would be for HBase only, and
> would not impact other applications sharing the cluster.
> - The client already modifies the nodes list returned by the NN, so
> we're not adding much responsibility here.
> - We just change the order of the blocks, nothing else it change in
> hdfs. Nevertheless, the dead node will be tried only if all the other
> nodes failed as well, so it solved the issue for the block transfer (I
> still need to look after the ipc.Client stuff, but it's another point
> hopefully)...
> Issues
> - it requires a change in hdfs...
> Do you see any other issue with this approach?
> Cheers,
> N.
> On Fri, Jul 13, 2012 at 7:46 PM, N Keywal <nkeywal@gmail.com> wrote:
>> From a performance point of view, I think it could be manageable.
>> If I put it to an extreme, today we're writing to 3 locations, with
>> local one being often useless. If we write only to the 2 remote
>> locations, we have the same reliability, without the issue of using
>> the dead node when we read for recovery.
>> And when we write to 2 remote locations today, we write to one which
>> is on a remote rack. So if tomorrow we write to 3 remote locations, 2
>> on the same rack and one on another:
>> - we don't add disk i/o to the cluster: still 3 blocks written in the cluster.
>> - the added latency should be low compared to the existing ones as
>> it's on the same rack.
>> - we're adding some network i/o, but on the same rack.
>> - as it's an append, we're on the same network socket, the connection
>> cost is not important.
>> So we're getting a reliability boost at a very reasonable price I
>> think (it's always cheap on paper :-) And, for someone needing better
>> write perfs, having only two replica is not unreasonable compared to
>> what we have today in terms of reliability...
>> I'm trying to find something different that could be made available
>> sooner without deployment issue. May be there is a hack possible
>> around DFSClient#reportBadBlocks, but there are some side effects as
>> well...
>> On Fri, Jul 13, 2012 at 7:11 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
>>> In that case, though, we'd slow down normal operation.
>>> Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned
in HBASE-6116, I have not been able to measure any performance improvement from this so far.
>>> -- Lars
>>> ________________________________
>>>  From: N Keywal <nkeywal@gmail.com>
>>> To: dev@hbase.apache.org
>>> Sent: Friday, July 13, 2012 6:27 AM
>>> Subject: Re: hbase mttr vs. hdfs
>>> I looked at this part of hdfs code, and
>>> - it's not simple to add it in a clean way, even if doing it is possible.
>>> - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
>>> actually. So changing this would not be without a lot of side effects.
>>> - as a side note HADOOP-8144 is interesting...
>>> So not writing the WAL on the local machine could be a good medium
>>> term option, that could likely be implemented with HDFS-385 (made
>>> available recently in "branch-1". I don't know what it stands for).
>>> On Fri, Jul 13, 2012 at 9:53 AM, N Keywal <nkeywal@gmail.com> wrote:
>>>> Another option would be to never write the wal locally: in nearly all
>>>> cases it won't be used as it's on the dead box. And then the recovery
>>>> would be directed by the NN to a dead DN in a single box failure. And
>>>> we would have 3 copies instead of 2, increasing global reliability...
>>>> On Fri, Jul 13, 2012 at 12:16 AM, N Keywal <nkeywal@gmail.com> wrote:
>>>>> Hi Todd,
>>>>> Do you think the change would be too intrusive for hdfs? I aggree,
>>>>> there are many less critical components in hadoop :-). I was hoping
>>>>> that this state could be internal to the NN and could remain localized
>>>>> without any interface change...
>>>>> Your proposal would help for sure. I see 3 points if we try to do it
>>>>> for specific functions like recovery.
>>>>>  - we would then need to manage the case when all 3 nodes timeouts
>>>>> after 1s, hoping that two of them are wrong positive...
>>>>>  - the writes between DN would still be with the old timeout. I didn't
>>>>> look in details at the impact. It won't be an issue for single box
>>>>> crash, but for large failure it could.
>>>>>  - we would want to change it to for the ipc.Client as well. Note sure
>>>>> if the change would not be visible to all functions.
>>>>> What worries me about setting very low timeouts is that it's difficult
>>>>> to validate, it tends to work until it goes to production...
>>>>> I was also thinking of making the deadNodes list public in the client,
>>>>> so hbase could tell to the DFSClient: 'this node is dead, I know it
>>>>> because I'm recovering the RS', but it would have some false positive
>>>>> (software region server crash), and seems a little like a
>>>>> workaround...
>>>>> In the middle (thinking again about your proposal), we could add a
>>>>> function in hbase that would first check the DNs owning the WAL,
>>>>> trying to connect with a 1s timeout, to be able to tell the DFSClient
>>>>> who's dead.
>>>>> Or we could put this function in DFSClient, a kind of boolean to say
>>>>> fail fast on dn errors for this read...
>>>>> On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon <todd@cloudera.com>
>>>>>> Hey Nicolas,
>>>>>> Another idea that might be able to help this without adding an entire
>>>>>> new state to the protocol would be to just improve the HDFS client
>>>>>> side in a few ways:
>>>>>> 1) change the "deadnodes" cache to be a per-DFSClient structure
>>>>>> instead of per-stream. So, after reading one block, we'd note that
>>>>>> DN was dead, and de-prioritize it on future reads. Of course we'd
>>>>>> to be able to re-try eventually since dead nodes do eventually
>>>>>> restart.
>>>>>> 2) when connecting to a DN, if the connection hasn't succeeded within
>>>>>> 1-2 seconds, start making a connection to another replica. If the
>>>>>> other replica succeeds first, then drop the connection to the first
>>>>>> (slow) node.
>>>>>> Wouldn't this solve the problem less invasively?
>>>>>> -Todd
>>>>>> On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <nkeywal@gmail.com>
>>>>>>> Hi,
>>>>>>> I have looked at the HBase MTTR scenario when we lose a full
box with
>>>>>>> its datanode and its hbase region server altogether: It means
a RS
>>>>>>> recovery, hence reading the logs files and writing new ones (splitting
>>>>>>> logs).
>>>>>>> By default, HDFS considers a DN as dead when there is no heartbeat
>>>>>>> 10:30 minutes. Until this point, the NaneNode will consider it
>>>>>>> perfectly valid and it will get involved in all read & write
>>>>>>> operations.
>>>>>>> And, as we lost a RegionServer, the recovery process will take
>>>>>>> so we will read the WAL & write new log files. And with the
RS, we
>>>>>>> lost the replica of the WAL that was with the DN of the dead
box. In
>>>>>>> other words, 33% of the DN we need are dead. So, to read the
WAL, per
>>>>>>> block to read and per reader, we've got one chance out of 3 to
go to
>>>>>>> the dead DN, and to get a connect or read timeout issue. With
>>>>>>> reasonnable cluster and a distributed log split, we will have
a sure
>>>>>>> winner.
>>>>>>> I looked in details at the hdfs configuration parameters and
>>>>>>> impacts. We have the calculated values:
>>>>>>> heartbeat.interval = 3s ("dfs.heartbeat.interval").
>>>>>>> heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
>>>>>>> heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30
>>>>>>> At least on 1.0.3, there is no shutdown hook to tell the NN to
>>>>>>> consider this DN as dead, for example on a software crash.
>>>>>>> So before the 10:30 minutes, the DN is considered as fully available
>>>>>>> by the NN.  After this delay, HDFS is likely to start replicating
>>>>>>> blocks contained in the dead node to get back to the right number
>>>>>>> replica. As a consequence, if we're too aggressive we will have
a side
>>>>>>> effect here, adding workload to an already damaged cluster. According
>>>>>>> to Stack: "even with this 10 minutes wait, the issue was met
in real
>>>>>>> production case in the past, and the latency increased badly".
May be
>>>>>>> there is some tuning to do here, but going under these 10 minutes
>>>>>>> not seem to be an easy path.
>>>>>>> For the clients, they don't fully rely on the NN feedback, and
>>>>>>> keep, per stream, a dead node list. So for a single file, a given
>>>>>>> client will do the error once, but if there are multiple files
it will
>>>>>>> go back to the wrong DN. The settings are:
>>>>>>> connect/read:  (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
>>>>>>> write: (5s (hardcoded) * NumberOfReplica) + 480s
>>>>>>> ("dfs.datanode.socket.write.timeout")
>>>>>>> That will set a 69s timeout to get a "connect" error with the
default config.
>>>>>>> I also had a look at larger failure scenarios, when we're loosing
>>>>>>> 20% of a cluster. The smaller the cluster is the easier it is
to get
>>>>>>> there. With the distributed log split, we're actually on a better
>>>>>>> shape from an hdfs point of view: the master could have error
>>>>>>> the files, because it could bet a dead DN 3 times in a row. If
>>>>>>> split is done by the RS, this issue disappears. We will however
get a
>>>>>>> lot of errors between the nodes.
>>>>>>> Finally, I had a look at the lease stuff Lease: write access
lock to a
>>>>>>> file, no other client can write to the file. But another client
>>>>>>> read it. Soft lease limit: another client can preempt the lease.
>>>>>>> Configurable.
>>>>>>> Default: 1 minute.
>>>>>>> Hard lease limit: hdfs closes the file and free the resources
>>>>>>> behalf of the initial writer. Default: 60 minutes.
>>>>>>> => This should not impact HBase, as it does not prevent the
>>>>>>> process to read the WAL or to write new files. We just need writes
>>>>>>> be immediately available to readers, and it's possible thanks
>>>>>>> HDFS-200. So if a RS dies we should have no waits even if the
>>>>>>> was not freed. This seems to be confirmed by tests.
>>>>>>> => It's interesting to note that this setting is much more
>>>>>>> than the one to declare a DN dead (1 minute vs. 10 minutes).
Or, in
>>>>>>> HBase, than the default ZK timeout (3 minutes).
>>>>>>> => This said, HDFS states this: "When reading a file open
for writing,
>>>>>>> the length of the last block still being written is unknown
>>>>>>> to the NameNode. In this case, the client asks one of the replicas
>>>>>>> the latest length before starting to read its content.". This
leads to
>>>>>>> an extra call to get the file length on the recovery (likely
with the
>>>>>>> ipc.Client), and we may once again go to the wrong dead DN. In
>>>>>>> case we have an extra socket timeout to consider.
>>>>>>> On paper, it would be great to set "dfs.socket.timeout" to a
>>>>>>> value during a log split, as we know we will get a dead DN 33%
of the
>>>>>>> time. It may be more complicated in real life as the connections
>>>>>>> shared per process. And we could still have the issue with the
>>>>>>> ipc.Client.
>>>>>>> As a conclusion, I think it could be interesting to have a third
>>>>>>> status for DN in HDFS: between live and dead as today, we could
>>>>>>> "sick". We would have:
>>>>>>> 1) Dead, known as such => As today: Start to replicate the
blocks to
>>>>>>> other nodes. You enter this state after 10 minutes. We could
even wait
>>>>>>> more.
>>>>>>> 2) Likely to be dead: don't propose it for write blocks, put
it with a
>>>>>>> lower priority for read blocks. We would enter this state in
>>>>>>> conditions:
>>>>>>>   2.1) No heartbeat for 30 seconds (configurable of course).
As there
>>>>>>> is an existing heartbeat of 3 seconds, we could even be more
>>>>>>> aggressive here.
>>>>>>>   2.2) We could have a shutdown hook in hdfs such as when a DN
>>>>>>> 'properly' it says to the NN, and the NN can put it in this 'half
>>>>>>> state'.
>>>>>>>   => In all cases, the node stays in the second state until
the 10.30
>>>>>>> timeout is reached or until a heartbeat is received.
>>>>>>>  3) Live.
>>>>>>>  For HBase it would make life much simpler I think:
>>>>>>>  - no 69s timeout on mttr path
>>>>>>>  - less connection to dead nodes leading to ressources held all
>>>>>>> the place finishing by a timeout...
>>>>>>>  - and there is already a very aggressive 3s heartbeat, so we
>>>>>>> not add any workload.
>>>>>>>  Thougths?
>>>>>>>  Nicolas
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera

View raw message