hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Keller <brya...@gmail.com>
Subject Re: How do I create per-reducer temporary files?
Date Fri, 06 May 2011 17:20:15 GMT
Thanks for the info Matt.

My use case is this. I have a fairly large amount of data being passed to my reducer. I need
to load all of the data into a matrix and run a linear program. The most efficient way to
do this from the LP side is to write all of the data to a file, and pass the file to the LP.
The file must be a local file, the LP library I'm using does not support streaming from HDFS,
unfortunately.

I want to do the processing in map-reduce, as it allows me to easily partition the problem
into parts and run the LP in parallel. What I am doing now is iterating through the results
in the reducer, writing to a local file, and then running the LP when the reducer is done
passing in data. I wanted to be able to use the same local directory that the reducer is using
so, if there are multiple reducers running, I can take advantage of all of the drives I have
configured in mapred.local.dir.


On May 4, 2011, at 10:59 PM, Matt Pouttu-Clarke wrote:

> Bryan,
> 
> Not sure you should be concerned with whether the output is on local vs.
> HDFS.  I wouldn't think there would be much of a performance difference if
> you are doing streaming output (append) in both cases.  Hadoop already uses
> local storage where ever possible (including for the task working
> directories as far as I know).  I've never had performance problems with
> side effect files, as long as the correct setup is used.
> 
> Definitely if multiple mounts are available locally where the tasks are
> running you can add a comma delimited list to mapreduce.cluster.local.dir in
> mapred-site.xml of those machines:
> http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site.
> xml
> 
> Theoretically you can use the methods I listed below to create unique
> files/paths under /tmp or any other mount point you wish.  However, it is
> much better to let Hadoop manage where the files are stored (i.e. Use the
> work directory given to you).
> 
> If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will
> spread the I/O from multiple mappers/reducers across these paths.  Likewise
> you can mount RAID 0 (stripe) of multiple drives to get the same effect.
> You can use a single RAID 0 to keep the mapred-site.xml uniform.  RAID 0 is
> fine since speculative execution takes care of if a disk fails.
> 
> If would be helpful to know your use case since the primary option is
> normally to create multiple outputs from a reducer:
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred
> uce/lib/output/MultipleOutputs.html
> 
> Most likely you should try that before going into the realm of side effect
> files (or messing with local temp on the task nodes).  Try the multiple
> outputs if you are dealing with streaming data.  If you absolutely cannot
> get it to work then you may have to cross check the other more complex
> options.
> 
> Cheers,
> Matt
> 
> 
> 
> On 5/4/11 1:07 PM, "Bryan Keller" <bryanck@gmail.com> wrote:
> 
>> Am I mistaken or are side-effect files on HDFS? I need my temp files to be on
>> the local filesystem. Also, the java working directory is not the reducer's
>> local processing directory, thus "./tmp" doesn't get me what I'm after. As it
>> stands now I'm using java.io.tmpdir which is not a long-term solution for me.
>> I am looking to use the reducer's task-specific local directory which should
>> be balanced across my local drives.
>> 
>> On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:
>> 
>>> Hi Bryan,
>>> 
>>> These are called side effect files, and I use them extensively:
>>> 
>>> O'Riley Hadoop 2nd Edition, p. 187
>>> Pro Hadoop, p. 279
>>> 
>>> You get the path to the save the file(s) using:
>>> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
>>> leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29
>>> 
>>> The output committer moves these files from the work directory to the output
>>> directory when the task completes.  That way you don't have duplicate files
>>> due to speculative execution.  You should also generate a unique name for
>>> each of your output files by using this function to prevent file name
>>> collisions:
>>> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
>>> leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
>>> .lang.String%29
>>> 
>>> Hope this helps,
>>> Matt
>>> 
>>> 
>>> On 5/4/11 12:18 PM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>> 
>>>> Right. What I am struggling with is how to retrieve the path/drive that the
>>>> reducer is using, so I can use the same path for local temp files.
>>>> 
>>>> On May 4, 2011, at 9:03 AM, Robert Evans wrote:
>>>> 
>>>>> Bryan,
>>>>> 
>>>>> I believe that map/reduce gives you a single drive to write to so that
your
>>>>> reducer has less of an impact on other reducers/mappers running on the
same
>>>>> box.  If you want to write to more drives I thought the idea would then
be
>>>>> to
>>>>> increase the number of reducers you have and let mapred assign each to
a
>>>>> drive to use, instead of having one reducer eating up I/O bandwidth from
>>>>> all
>>>>> of the drives.
>>>>> 
>>>>> --Bobby Evans
>>>>> 
>>>>> On 5/4/11 7:11 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>>>> 
>>>>> I too am looking for the best place to put local temp files I create
during
>>>>> reduce processing. I am hoping there is a variable or property someplace
>>>>> that
>>>>> defines a per-reducer temp directory. The "mapred.child.tmp" property
is by
>>>>> default simply the relative directory "./tmp" so it isn't useful on it's
>>>>> own.
>>>>> 
>>>>> I have 5 drives being used in "mapred.local.dir", and I was hoping to
use
>>>>> them all for writing temp files, rather than specifying a single temp
>>>>> directory that all my reducers use.
>>>>> 
>>>>> 
>>>>> On Apr 9, 2011, at 2:40 AM, Harsh J wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill <billmcn@gmail.com>
wrote:
>>>>>>> If I try:
>>>>>>> 
>>>>>>>   storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
>>>>>>> ".seq");
>>>>>>>   writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
>>>>>>>         configuration, storePath, IntWritable.class, itemClass);
>>>>>>>   ...
>>>>>>>   reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>>> storePath, configuration);
>>>>>>> 
>>>>>>> I get an exception about a mismatch in file systems when trying
to read
>>>>>>> from
>>>>>>> the file.
>>>>>>> 
>>>>>>> Alternately if I try:
>>>>>>> 
>>>>>>>   storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
>>>>>>> "my-file", ".seq"));
>>>>>>>   writer = SequenceFile.createWriter(FileSystem.get(configuration),
>>>>>>>         configuration, storePath, IntWritable.class, itemClass);
>>>>>>>   ...
>>>>>>>   reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>>> storePath, configuration);
>>>>>> 
>>>>>> FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
>>>>>> since you are looking to create local temporary files to be used
only
>>>>>> by the task within itself, you shouldn't really worry about unique
>>>>>> filenames (stuff can go wrong).
>>>>>> 
>>>>>> You're looking for the tmp/ directory locally created in the FS where
>>>>>> the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
>>>>>> You can create a regular file there using vanilla Java APIs for files,
>>>>>> or using RawLocalFS + your own created Path (not derived via
>>>>>> OutputFormat/etc.).
>>>>>> 
>>>>>>>   storePath = new Path(new
>>>>>>> Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
>>>>>>>   writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
>>>>>>>         configuration, storePath, IntWritable.class, itemClass);
>>>>>>>   ...
>>>>>>>   reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>>> storePath, configuration);
>>>>>> 
>>>>>> The above should work, I think (haven't tried, but the idea is to
use
>>>>>> the mapred.child.tmp).
>>>>>> 
>>>>>> Also see: 
>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Director
>>>>>> y+
>>>>>> Structure
>>>>>> 
>>>>>> --
>>>>>> Harsh J
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> iCrossing Privileged and Confidential Information
>>> This email message is for the sole use of the intended recipient(s) and may
>>> contain confidential and privileged information of iCrossing. Any
>>> unauthorized review, use, disclosure or distribution is prohibited. If you
>>> are not the intended recipient, please contact the sender by reply email and
>>> destroy all copies of the original message.
>>> 
>>> 
>> 
> 


Mime
View raw message