hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "heyongqiang" <heyongqi...@software.ict.ac.cn>
Subject Re: modified word count example
Date Wed, 09 Jul 2008 01:08:12 GMT
 InputFormat's method RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException; return a RecordReader.
You can implement your own InputFormat and RecordReader:
1)the RecorderReader remember the FileSplit(subclass of InputSplit) field in its class
2) RecordReader's createValue() method always return the FileSplit's file field.

hope this helps.


发件人: Sandy
发送时间: 2008-07-09 01:45:15
收件人: core-user@hadoop.apache.org
主题: modified word count example


Let's say I want to run a map reduce job on a series of text files (let's
say x.txt y.txt and z.txt)

Given the following mapper function in python (from WordCount.py):

class WordCountMap(Mapper, MapReduceBase):
    one = IntWritable(1) # removed
    def map(self, key, value, output, reporter):
        for w in value.toString().split():
            output.collect(Text(w), self.one) #how can I modify this line?

Instead of creating pairs for each word found and the numeral one as the
example is doing, is there a function I can invoke to store the name of the
file it came from instead?

thus, i'd have pairs like  <"water", "x.txt" >  <"hadoop", y.txt >   <"hadoop",
"z.txt" > etc.

I took a look at javadoc, but i'm not sure if I've checked in the right
places. Could someone point me in the right direction?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message