hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Hadoop: Multiple map reduce or some better way
Date Fri, 04 Apr 2008 15:54:01 GMT

Are you implementing this for instruction or production?

If production, why not use Lucene?


On 4/3/08 6:45 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:

> HI  Amar , Theodore, Arun,
> 
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code for inverted index. This code maps each word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
> 
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
>     implements Mapper<LongWritable, Text, Text, Text> {
> 
>     private final static IntWritable one = new IntWritable(1);
>     private Text word = new Text();
>     private Text doc = new Text();
>     private long numRecords=0;
>     private String inputFile;
> 
>    public void configure(JobConf job){
>         System.out.println("Configure function is called");
>         inputFile = job.get("map.input.file");
>         System.out.println("In conf the input file is"+inputFile);
>     }
> 
> 
>     public void map(LongWritable key, Text value,
>                     OutputCollector<Text, Text> output,
>                     Reporter reporter) throws IOException {
>       String line = value.toString();
>       StringTokenizer itr = new StringTokenizer(line);
>       doc.set(inputFile);
>       while (itr.hasMoreTokens()) {
>         word.set(itr.nextToken());
>         output.collect(word,doc);
>       }
>       if(++numRecords%4==0){
>        System.out.println("Finished processing of input file"+inputFile);
>      }
>     }
>   }
> 
>   /**
>    * A reducer class that just emits the sum of the input values.
>    */
>   public static class Reduce extends MapReduceBase
>     implements Reducer<Text, Text, Text, DocIDs> {
> 
>   // This works as K2, V2, K3, V3
>     public void reduce(Text key, Iterator<Text> values,
>                        OutputCollector<Text, DocIDs> output,
>                        Reporter reporter) throws IOException {
>       int sum = 0;
>       Text dummy = new Text();
>       ArrayList<String> IDs = new ArrayList<String>();
>       String str;
> 
>       while (values.hasNext()) {
>          dummy = values.next();
>          str = dummy.toString();
>          IDs.add(str);
>        }
>        DocIDs dc = new DocIDs();
>        dc.setListdocs(IDs);
>       output.collect(key,dc);
>     }
>   }
> 
>  public int run(String[] args) throws Exception {
>   System.out.println("Run function is called");
>     JobConf conf = new JobConf(getConf(), WordCount.class);
>     conf.setJobName("wordcount");
> 
>     // the keys are words (strings)
>     conf.setOutputKeyClass(Text.class);
> 
>     conf.setOutputValueClass(Text.class);
> 
> 
>     conf.setMapperClass(MapClass.class);
> 
>     conf.setReducerClass(Reduce.class);
> }
> 
> 
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
> 
> In the next stage I want to stop 'stop  words',  scrub words etc. and like
> position of the word in the document. How would I apply multiple maps or
> multilevel map reduce jobs programmatically? I guess I need to make another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
> 
> Thanks,
> Aayush
> 
> 
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
> 
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> 
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop. My map
>>> function has the output:
>>> <word, doc>
>>> and the reducer has:
>>> <word, list(docs)>
>>> 
>>> Now I want to use one more mapreduce to remove stop and scrub words from
>> Use distributed cache as Arun mentioned.
>>> this output. Also in the next stage I would like to have short summay
>> Whether to use a separate MR job depends on what exactly you mean by
>> summary. If its like a window around the current word then you can
>> possibly do it in one go.
>> Amar
>>> associated with every word. How should I design my program from this
>> stage?
>>> I mean how would I apply multiple mapreduce to this? What would be the
>>> better way to perform this?
>>> 
>>> Thanks,
>>> 
>>> Regards,
>>> -
>>> 
>>> 
>> 


Mime
View raw message