hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-788) Datanode behaves badly when one disk is very low on space
Date Sat, 28 Nov 2009 05:06:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783207#action_12783207
] 

Todd Lipcon commented on HDFS-788:
----------------------------------

To add a bit of detail:

You can pretty easily recreate this issue by setting up a 3-node cluster where each DN has
two drives, one large and one very small (say 2GB). Running RandomWriter will then actually
fail due to lost pipeline errors. What happens is that, when the small volume has around 100M
left, it will still allow multiple in-flight blocks to write to it. When each of these blocks
hits the end of the volume, they'll fail. Due to HADOOP-5796, the pipeline retry sometimes
is incorrect and actually includes the same datanode - at that point it will try to *continue*
the half-written block, and of course fail again since the disk is out of space.

This bug also accounts for confusing behavior with regards to dfs.datanode.du.reserved - if
you reserve 1GB, for example, you can sometimes end up "overshooting" that reservation and
end up with only 600M free. The amount of overshoot depends on the average number of concurrent
writers to a given disk. With client applications like HBase that write commit logs slowl
over time, or slow MR jobs on clusters with lots of slots, this can be reasonably high. A
large block size exacerbates the issue.

The solution is to use the already-existing list of in-progress writes to track how much space
has been "promised" (the writers already pass their block size in the request for a volume).
The "promise" value should be subtracted from the available space in determining the round
robin policy. This ends up being conservative, since it double-accounts a block that has written
most of its data but not yet committed yet. In my opinion conservative is OK - I'd rather
have it save a bit more space than absolutely necessary rather than fail the writes. To be
absolutely correct, we'd have to track the amount of space used by each of the in-progress
writers, and subtract that from the "promise" during the accounting - probably not too difficult
but a slightly more involved change.

> Datanode behaves badly when one disk is very low on space
> ---------------------------------------------------------
>
>                 Key: HDFS-788
>                 URL: https://issues.apache.org/jira/browse/HDFS-788
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>            Reporter: Todd Lipcon
>
> FSDataset.getNextVolume() uses FSVolume.getAvailable() to determine whether to allocate
a block on a volume. This doesn't factor in other in-flight blocks that have been "promised"
space on the volume. The resulting issue is that, if a volume is nearly full but not full,
multiple blocks will be allocated on that volume, and then they will all hit "Out of space"
errors during the write.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message