hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Hadoop: Multiple map reduce or some better way
Date Fri, 04 Apr 2008 21:36:46 GMT


See Nutch.  See Nutch run.

http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/



On 4/4/08 1:22 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:

> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ning.li.00@gmail.com> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>> 
>> Regards,
>> Ning
>> 
>> On 4/4/08, Aayush Garg <aayush.garg@gmail.com> wrote:
>>> No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>>> trying to build up this type of inverted index and then measure
>> performance
>>> criteria with respect to others.
>>> 
>>> Thanks,
>>> 
>>> 
>>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <tdunning@veoh.com> wrote:
>>> 
>>>> 
>>>> Are you implementing this for instruction or production?
>>>> 
>>>> If production, why not use Lucene?
>>>> 
>>>> 
>>>> On 4/3/08 6:45 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>>>> 
>>>>> HI  Amar , Theodore, Arun,
>>>>> 
>>>>> Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>>>> much.
>>>>> I have written following code for inverted index. This code maps
>> each
>>>> word
>>>>> from the document to its document id.
>>>>> ex: apple file1 file123
>>>>> Main functions of the code are:-
>>>>> 
>>>>> public class HadoopProgram extends Configured implements Tool {
>>>>> public static class MapClass extends MapReduceBase
>>>>>     implements Mapper<LongWritable, Text, Text, Text> {
>>>>> 
>>>>>     private final static IntWritable one = new IntWritable(1);
>>>>>     private Text word = new Text();
>>>>>     private Text doc = new Text();
>>>>>     private long numRecords=0;
>>>>>     private String inputFile;
>>>>> 
>>>>>    public void configure(JobConf job){
>>>>>         System.out.println("Configure function is called");
>>>>>         inputFile = job.get("map.input.file");
>>>>>         System.out.println("In conf the input file is"+inputFile);
>>>>>     }
>>>>> 
>>>>> 
>>>>>     public void map(LongWritable key, Text value,
>>>>>                     OutputCollector<Text, Text> output,
>>>>>                     Reporter reporter) throws IOException {
>>>>>       String line = value.toString();
>>>>>       StringTokenizer itr = new StringTokenizer(line);
>>>>>       doc.set(inputFile);
>>>>>       while (itr.hasMoreTokens()) {
>>>>>         word.set(itr.nextToken());
>>>>>         output.collect(word,doc);
>>>>>       }
>>>>>       if(++numRecords%4==0){
>>>>>        System.out.println("Finished processing of input
>>>> file"+inputFile);
>>>>>      }
>>>>>     }
>>>>>   }
>>>>> 
>>>>>   /**
>>>>>    * A reducer class that just emits the sum of the input values.
>>>>>    */
>>>>>   public static class Reduce extends MapReduceBase
>>>>>     implements Reducer<Text, Text, Text, DocIDs> {
>>>>> 
>>>>>   // This works as K2, V2, K3, V3
>>>>>     public void reduce(Text key, Iterator<Text> values,
>>>>>                        OutputCollector<Text, DocIDs> output,
>>>>>                        Reporter reporter) throws IOException {
>>>>>       int sum = 0;
>>>>>       Text dummy = new Text();
>>>>>       ArrayList<String> IDs = new ArrayList<String>();
>>>>>       String str;
>>>>> 
>>>>>       while (values.hasNext()) {
>>>>>          dummy = values.next();
>>>>>          str = dummy.toString();
>>>>>          IDs.add(str);
>>>>>        }
>>>>>        DocIDs dc = new DocIDs();
>>>>>        dc.setListdocs(IDs);
>>>>>       output.collect(key,dc);
>>>>>     }
>>>>>   }
>>>>> 
>>>>>  public int run(String[] args) throws Exception {
>>>>>   System.out.println("Run function is called");
>>>>>     JobConf conf = new JobConf(getConf(), WordCount.class);
>>>>>     conf.setJobName("wordcount");
>>>>> 
>>>>>     // the keys are words (strings)
>>>>>     conf.setOutputKeyClass(Text.class);
>>>>> 
>>>>>     conf.setOutputValueClass(Text.class);
>>>>> 
>>>>> 
>>>>>     conf.setMapperClass(MapClass.class);
>>>>> 
>>>>>     conf.setReducerClass(Reduce.class);
>>>>> }
>>>>> 
>>>>> 
>>>>> Now I am getting output array from the reducer as:-
>>>>> word \root\test\test123, \root\test12
>>>>> 
>>>>> In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
>>>> like
>>>>> position of the word in the document. How would I apply multiple
>> maps or
>>>>> multilevel map reduce jobs programmatically? I guess I need to make
>>>> another
>>>>> class or add some functions in it? I am not able to figure it out.
>>>>> Any pointers for these type of problems?
>>>>> 
>>>>> Thanks,
>>>>> Aayush
>>>>> 
>>>>> 
>>>>> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <amarrk@yahoo-inc.com>
>>>> wrote:
>>>>> 
>>>>>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>>>>>> 
>>>>>>> HI,
>>>>>>> I am developing the simple inverted index program frm the hadoop.
>> My
>>>> map
>>>>>>> function has the output:
>>>>>>> <word, doc>
>>>>>>> and the reducer has:
>>>>>>> <word, list(docs)>
>>>>>>> 
>>>>>>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>>>> from
>>>>>> Use distributed cache as Arun mentioned.
>>>>>>> this output. Also in the next stage I would like to have short
>> summay
>>>>>> Whether to use a separate MR job depends on what exactly you mean
>> by
>>>>>> summary. If its like a window around the current word then you can
>>>>>> possibly do it in one go.
>>>>>> Amar
>>>>>>> associated with every word. How should I design my program from
>> this
>>>>>> stage?
>>>>>>> I mean how would I apply multiple mapreduce to this? What would
be
>> the
>>>>>>> better way to perform this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Regards,
>>>>>>> -
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Aayush Garg,
>>> Phone: +41 76 482 240
>>> 
>> 
> 
> 


Mime
View raw message