hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <tuc...@gmail.com>
Subject Re: MultipleTextOutputFormat splitting output into different directories.
Date Wed, 16 Sep 2009 03:57:22 GMT
Using the MultipleOutputs (
http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
) you can split data in different files in the outputdir.

After your job finishes you can move the files to different directories.

The benefit of this doing this is that task failures and speculative
execution will also take into account all these files and your data
will be consistent. If you are writing to different directories then
you have to handle this by hand.

A

On Tue, Sep 15, 2009 at 4:22 PM, Aviad sela <sela.stam@gmail.com> wrote:
> Is any body interested ,addressed such probelm.
> Or does it seem to be esoteric usage ?
>
>
>
>
> On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela <sela.stam@gmail.com> wrote:
>
>>  I am using Hadoop 0.19.1
>>
>> I attempt to split an input into multiple directories.
>> I don't know in advance how many directories exists.
>> I don't know in advance what is the directory depth.
>> I expect that under each such directory a file exists with all availble
>> records having the same key permutation
>> found in the job.
>>
>> If currently each reducer produce a single output i.e. PART-0001
>> I would like to create as many directory necessary taking the following
>> pattern:
>>
>>                key1 / key2/ .../ keyN/ PART-0001
>>
>> where the  "key?"  may have different values for each input record.
>> different record may results with a different path requested:
>>               key1a/key2b/PART-0001
>>               key1c/key2d/key3e/PART-0001
>> to keep it simple, during each job we may expect the same depth from each
>> record.
>>
>> I assume that the input records imply that each reduce will produce several
>> hundreds of such directories.
>> (Indeed this strongly depends on the input record semantic).
>>
>>
>> The MAP part reads a record,following some logic, assign a key like :
>> KEY_A, KEY_B
>> The MAP Value is the original input line.
>>
>>
>> For The reducer part I assign the IdentityReducer.
>> However have set :
>>
>>     jobConf.setReducerClass(IdentityReducer.
>> *class*);
>>
>>     jobConf.setOutputFormat(MyTextOutputFormat.*class*);
>>
>>
>>
>> Where the MyTextOutputFormat  extends MultipleTextOutputFormat, and
>> implements:
>>
>>     protected String generateFileNameForKeyValue(K key, V value, String
>> name)
>>     {
>>         String keyParts[] = key.toString().split(",");
>>         Path finalPath = null;
>>         // Build the directory structure comprised of the Key parts
>>        for (int i=0; i < keyParts.length; i++)
>>        {
>>             String part = keyParts[i].trim();
>>            if  (false == "".equals(part))
>>            {
>>                if (null == finalPath)
>>                           finalPath = new Path(part);
>>                 else
>>                 {
>>                         finalPath = new Path(finalPath, part);
>>                 }
>>            }
>>         } //end of for
>>
>>        String fileName = generateLeafFileName(name);
>>        finalPath = new Path(finalPath, fileName);
>>
>>        return finalPath.toString();
>>  } //generatedFileNameKeyValue
>> During execution I have seen the reduce attempts does create the following
>> path under the output path:
>>
>> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000000_0/KEY_A/KEY_B/part-00000"
>>    However, the file was empty.
>>
>>
>>
>> The job fails at the end with the the following exceptions found in the
>> task log:
>>
>> 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
>> createBlockOutputStream java.io.IOException: Bad connect ack with
>> firstBadLink 9.148.30.71:50010
>> 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> block blk_-6138647338595590910_39383
>> 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>> Exception: java.io.IOException: Unable to create new block.
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>>
>> 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes ==
>> null
>> 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not
>> get block locations. Source file
>> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000002_0/KEY_A/KEY_B/part-00002"
>> - Aborting...
>> 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error
>> running child
>> java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
>> cleanup for the task
>>
>>
>> The Command line also writes:
>>
>> 09/09/09 11:24:06 INFO mapred.JobClient: Task Id :
>> attempt_200909080349_0013_r_000003_2, Status : FAILED
>> java.io.IOException: Bad connect ack with firstBadLink 9.148.30.80:50010
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>>
>>
>>
>> Any Ideas how to support such a scenario
>>
>

Mime
View raw message