hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Connell, Chuck" <Chuck.Conn...@nuance.com>
Subject RE: Extra output files from mapper ?
Date Thu, 12 Jul 2012 14:38:16 GMT
Here is a test case...


The Python code (file_io.py) that I want to run as a map-only job is below. It takes one input
file (not stdin) and creates two output files (not stdout).

#!/usr/bin/env python

import sys

infile = open(sys.argv[1], 'r')
outfile1 = open(sys.argv[2], 'w')
outfile2 = open(sys.argv[3], 'w')

for line in infile:
     sys.stdout.write(line)  # just to verify that infile is being read correctly
     outfile1.write("1. " + line)
     outfile2.write("2. " + line)


But since MapReduce streaming likes to use stdio, I put my job in a Python wrapper (file_io_wrap.py):

#!/usr/bin/env python

import sys
from subprocess import call

# Eat input stream on stdin
line = sys.stdin.readline()
while line:
    line = sys.stdin.readline()

# Call real program.
status = call (["python", "file_io.py", "in1.txt", "out1.txt", "out2.txt"])

# Write to stdout.
if status==0:
     sys.stdout.write("Success.")
else:
     sys.stdout.write("Subprocess call failed.")


Finally, I call the streaming job from this shell script...

#!/bin/bash

#Find latest streaming jar.
STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"

# Input file should explicitly use hdfs: to avoid confusion with local file
# Output dir should not exist.
# The mapper and reducer should explicitly state "python XXX.py" rather than just "XXX.py"

$STREAM  \
-files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
-files "hdfs://localhost/tmp/out1.txt#out1.txt" \
-files "hdfs://localhost/tmp/out2.txt#out2.txt" \
-file file_io_wrap.py \
-file file_io.py \
-input "hdfs://localhost/tmp/input/empty.txt" \
-mapper "python file_io_wrap.py" \
-reducer NONE \
-output /tmp/output20


The result is that the whole job runs correctly and the input file is read correctly. I can
see a copy of the  input file in part-0000. But the output files (out1.txt and out2.txt) are
nowhere to be found. I suspect they were created somewhere, but where? And how can I control
where they are created?

Thank you,
Chuck Connell
Nuance R&D Data Team
Burlington, MA



From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
Sent: Wednesday, July 11, 2012 4:48 PM
To: mapreduce-user@hadoop.apache.org
Subject: Extra output files from mapper ?

I am using MapReduce streaming with Python code. It works fine, for basic for stdin and stdout.

But I have a mapper-only application that also emits some other output files. So in addition
to stdout, the program also creates files named output1.txt and output2.txt. My code seems
to be running correctly, and I suspect the proper output files are being created somewhere,
but I cannot find them after the job finishes.

I tried using the -files option to create a link to the location I want the file, but no luck.
I tried using some of the -jobconf options to change the various working directories, but
no luck.

Thank you.

Chuck Connell
Nuance R&D Data Team
Burlington, MA


Mime
View raw message