hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy <snickerdoodl...@gmail.com>
Subject Re: guaranteeing disk space?
Date Mon, 15 Sep 2008 20:13:20 GMT
I'm not sure if this completely answers your question, but I don't think
there is something built-in inside hadoop that automizes cleanup (between MR
phases). You may have to do it yourself afterwards.

Are you dealing with multiple map reduce phases? If so, your intermediary
files (which are stored in the intermediary directory of your choice) can
just be deleted afterwards. When I'm running multiple jobs I usually write a
wrapper script that runs the job and removes the intermediary files before
it runs the next job.

#!/bin/bash

hadoop jar myjar.jar myjob input intermediate output
bin/hadoop -rmr intermediate
hadoop jar myjar.jar myjob input2 intermediate output2


I had disk space errors when I was running my job on a previous machine. The
problem is that too much was filling up in the logs (so I took out some
stdout statements that I was using for debugging purposes), and then I added
the whole -rmr thing to my scripts.

But what will definitely work is getting a machine with more hard disk.

Good luck,
-SM

On Mon, Sep 15, 2008 at 1:24 PM, Kayla Jay <kaylais30@yahoo.com> wrote:

> How does one do a check or guarantee there's enough disk space when running
> a hadoop job  that you're not sure how much it will produce in its results
> (temp files, etc) ?
>
> I.e when you run a hadoop job and you're not exactly sure how much disk
> space it will eat up (given temp dirs), the job will fail if it does run
> out.
>
> How do you guarantee while you're job is running that there's enough disk
> space on the nodes and kick off cleanup (so the job won't fail) if you're
> running into low disk space?
>
> For example, if your maps are failing since there isn't enough  temporary
> disk space on your nodes while you run a job, how can you fix that up front
> prior to running or better yet while the job is running from causing a
> failed job? The outputs of maps are stored on the local-disk of the nodes
> on which they were executed, and if your nodes don't have  enough while
> running jobs, how can you fix this at run time?  Can I catch this condition
> at all?
>
> Is there a way to fix this at run time?  How do others solve this issue
> when running jobs that you're not sure how much disk space it will consume?
>
> -----------
> Or, what if you run out of disk space on the HDFS if you are running large
> jobs with large outputs ?  The job just fails .. but how can one assess this
>  resource allocation of disk space while running your jobs?
>
> If you run out of HDFS disk space, and you know you want the results of job
> X, is there a way to find out while running that you can do some smart
> cleanup as to not lose what data could've been produced by job X?
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message