hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meng Mao <meng...@gmail.com>
Subject Re: ways to expand hadoop.tmp.dir capacity?
Date Wed, 05 Oct 2011 05:44:09 GMT
I just read this:

MapReduce performance can also be improved by distributing the temporary
data generated by MapReduce tasks across multiple disks on each machine:



Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
expanded capacity we're looking for be as easily accomplished as by defining
mapred.local.dir to span multiple disks? Setting aside the issue of temp
files so big that they could still fill a whole disk.

On Wed, Oct 5, 2011 at 1:32 AM, Meng Mao <mengmao@gmail.com> wrote:

> Currently, we've got defined:
>   <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/hadoop/hadoop-metadata/cache/</value>
>   </property>
> In our experiments with SOLR, the intermediate files are so large that they
> tend to blow out disk space and fail (and annoyingly leave behind their huge
> failed attempts). We've had issues with it in the past, but we're having
> real problems with SOLR if we can't comfortably get more space out of
> hadoop.tmp.dir somehow.
> 1) It seems we never set *mapred.system.dir* to anything special, so it's
> defaulting to ${hadoop.tmp.dir}/mapred/system.
> Is this a problem? The docs seem to recommend against it when
> hadoop.tmp.dir had ${user.name} in it, which ours doesn't.
> 1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
> system files." To me, that means there's must be 1 single path for
> mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> Otherwise, one might imagine that you could specify multiple paths to store
> hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?
> 2) IIRC, there's a -D switch for supplying config name/value pairs into
> indivdiual jobs. Does such a switch exist? Googling for single letters is
> fruitless. If we had a path on our workers with more space (in our case,
> another hard disk), could we simply pass that path in as hadoop.tmp.dir for
> our SOLR jobs? Without incurring any consistency issues on future jobs that
> might use the SOLR output on HDFS?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message