hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Chu <ch...@llnl.gov>
Subject Disk full errors in local-dirs, what data is stored in yarn.nodemanager.local-dirs?
Date Tue, 11 Apr 2017 23:42:41 GMT
Hi,

I have a cluster where we have a parallel networked file system for our
major data storage and our nodes have ~750G of local SSD space.  To
speed up things, we configure yarn.nodemanager.local-dirs to use the
local SSD for local caching.

Recently, I've been trying to do a terasort of 2 terabytes of data over
8 nodes w/ Hadoop 2.7.3.  So that's about 6000 gigs of local SSD space
for caching, or 5400 gigs when hadoop uses its 90% disk full checking
limit.

I always get diskfull errors such as the below when running:

2017-04-11 12:31:44,062 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection:
Directory /l/ssd/achutest/localstore/yarn-nm error, used space above threshold of 90.0%, removing
from list of valid directories
2017-04-11 12:31:44,063 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
Disk(s) failed: 1/1 local-dirs are bad: /l/ssd/achutest/localstore/yarn-nm;
2017-04-11 12:31:44,063 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
Most of the disks failed. 1/1 local-dirs are bad: /l/ssd/achutest/localstore/yarn-nm;

What I don't understand is how I am getting diskfull errors.  Within
terasort, I should have at most 2000 gigs of mapped intermediate data
and at most 2000 gigs of merged data in reducers.  Even assuming some
overhead from Hadoop, I should have more than enough space for this
benchmark to complete given maps and reducers are spread out evenly
across nodes.

So my assumption is something else is being cached in local-dirs that
I'm not accounting for.  Is there any other data I should consider when
coming up with my estimates?

One guess I had.  Is it possible spilled data from reducer merges are
not deleted until a reducer completes?  Given my example above, the
total amount of merged data in reducers may exceed 2000 gigs at some
point?

Al

-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Mime
View raw message