hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Berry, Matt" <mwbe...@amazon.com>
Subject Lines missing from output files (
Date Fri, 20 Jul 2012 00:28:37 GMT
I have a slightly modified Text Output Format that essentially writes each key into its own
file. It operates off the premise that my reducer is an identity function and it emits each
record one-by-one in the order they come from the collection. Because the records are emitted
in order from the reducer, I can maintain one open output file and close it when a new key
appears. The reason I am doing it like this instead of using MultipleOutputs is that I am
locked into hadoop 

The problem I am having is that I am randomly getting IOExceptions due to opening an existing
file. There are two ways I imagine this could happen. (1) Reducer 1  emits a record for key
A and then Reducer 2 emits a record for Key A. I'm certain this is not the case as the keys
should all group together. (2) The records are emitted out of order from a single reducer
(AAAA BBBB A) in which case the reducer would try to open A again.

What is perplexing me is that in addition to the output files for each key, each output format
opens a log file. I am seeing an exception propagate out from the reducer, but no such error
appears in my log file. Some sample code follows to clarify.

class ModifiedTextOutputFormat {

  public ModifiedTextOutputFormat() {

  protected createOutputFile(name) {
    try {
    } catch (Throwable t) {
      logFile.writeBytes("Information about the error");  // Here I log the error (although
it is missing later)
      closeLogFile(); // Here I close that file to be certain the last line is flushed
      throw new IOException("Information about the error",t); // Here I throw an exception,
which appears on stderr

  public write(Key k, Value v) {
    if(!k.toString.equals(current)) { 

View raw message