hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2991) dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)
Date Wed, 12 Mar 2008 06:08:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577719#action_12577719

Allen Wittenauer commented on HADOOP-2991:


Ahh file systems.  Can't live with them, can't live with them.

First off: I'm not a big fan of percentages when dealing with file systems.

Back in the day, UFS would reserve 10% of the system for root's usage.  So on a 10G disk,
it would save 1G for itself.  Not a big deal and when the file system had issues, that worked
out well.  100G would turn into 10G.  Ugh.  Not cool.  Go even bigger and the amounts get
insane.  So many implementations changed this to a sliding scale rather than a single percentage.
 Some food for thought.

Secondly, df.  A great source of cross platform trouble... Let me throw out one of my favorite
real world examples, this time from one of my home machines:

Filesystem             size   used  avail capacity  Mounted on
int                    165G    28K    21G     1%    /int
int/home               165G    68G    21G    77%    /export/home
int/mii2u              165G  1014K    21G     1%    /int/mii2u
int/squid-cache        5.0G   4.4G   591M    89%    /int/squid-cache
int/local              165G   289M    21G     2%    /usr/local

Stop.  Go back and look carefully at those numbers.  

In case you haven't guessed, this is a (partial) output of a df -h from a Solaris machine
utilizing ZFS.  It is pretty clear that with the exception of the file system using a hard
quota (int/squid-cache), size != used+available.  Instead, size=(all fs used)+available. 
Using "used" in any sort of capacity isn't going to tell you anything about how much space
is actually available.  This type of output is fairly common for any pool-based storage system.

Then there are file system quotas, which depending upon the OS, may or may not show up in
df output.  The same thing with the aforementioned percentages with reserved space.

Anyway, what does this all mean?

Well, in my mind, that all of the above suggestions in the JIRA just really don't work out
well... and that's just on UNIX.  Heck, even a heterogeneous UNIX environment makes me shudder.
 How does one work with pooled storage *and* traditional file systems if you want to have
a single config?

Quite frankly, you can't.  As much as I hate to say it, I suspect the answer (as unpopular
as it might be) is probably to set a hard limit on how much space the HDFS will use rather
than trying to second guess what the operating system is doing.  Does this suck?  Yes.  Does
this suck less than all of the gymnastics around trying to figure this out dynamically?  I
think so.

Let's face it, in order to make an app like Hadoop not eat more space than what you want vs.
what is configured in the file system, you are essentially looking at partitioning it.  At
that point, you might as well just configure it in the app and be done with it. In the end,
this basically means that HDFS needs to keep track of how much space it is using at all times
and not go over that limit.  This likely also means that it must implement high and low water
marks such that if the low water mark is hit, writes to the filesystem get deferred/deprioritized
and high water marks basically mean to start rebalancing the blocks or saying the file system
is full.

Now, I know that it might be difficult to calculate what the max space should be.  On reflection
though, I'm not really sure that's true.  If I know what size my slice is and I have an idea
of how much of that I want to give to HDFS, then I can calculate that max value.  If an admin
gets in trouble with the space being allocated, the ability to lower the high and low water
marks, which should trigger a rebalance, thus freeing space.  This is essentially how apps
like squid work. It works quite well.  [Interestingly enough, the file system structure on
disk is quite similar to how the data node stores its blocks.... Hmm... ]

One thing to point out with this solution:  if the admin overcommits the space on the drive
then, quite frankly, they hung themselves.  They know how much space they gave HDFS.  If they
go over it, oh well.  I'd much rather have MapRed blow up than HDFS blow up, since it is much
easier to pick up the pieces of a broken job than it is of the file system, especially in
the case where there are under-replicated blocks.

Again, I totally admit that this solution is likely to be unpopular.  But I can't see a way
out of this mess that works with the multiple types of storage systems in use.

P.S., while I'm here, let me throw my more of my own personal prejudices into this:  putting
something like hadoop in / or some other file system  (but not necessarily device) that is
used by the OS is just *begging* for trouble.  That's just a bad practice for a real, production
system.  If someone does that, they rightly deserve any pain that it caused.

> dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)
> ----------------------------------------------------------------------------
>                 Key: HADOOP-2991
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2991
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.15.0, 0.15.1, 0.15.2, 0.15.3, 0.16.0
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
> changes for https://issues.apache.org/jira/browse/HADOOP-1463
> have caused a regression. earlier:
> - we could set dfs.du.reserve to 1G and be *sure* that 1G would not be used.
> now this is no longer true. I am quoting Pete Wyckoff's example:
> <example>
> Let's look at an example. 100 GB disk and /usr using 45 GB and dfs using 50 GBs now
> Df -kh shows:
> Capacity = 100 GB
> Available = 1 GB (remember ~4 GB chopped out for metadata and stuff)
> Used = 95 GBs   
> remaining = 100 GB - 50 GB - 1GB = 49 GB 
> Min(remaining, available) = 1 GB
> 98% of which is usable for DFS apparently - 
> So, we're at the limit, but are free to use 98% of the remaining 1GB.
> </example>
> this is broke. based on the discussion on 1463 - it seems like the notion of 'capacity'
as being the first field of 'df' is problematic. For example - here's what our df output looks
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda3             130G  123G   49M 100% /
> as u can see - 'Size' is a misnomer - that much space is not available. Rather the actual
usable space is 123G+49M ~ 123G. (not entirely sure what the discrepancy is due to - but have
heard this may be due to space reserved for file system metadata). Because of this discrepancy
- we end up in a situation where file system is out of space.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message