hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Shi <s...@gopivotal.com>
Subject Re: Need FileName with Content
Date Fri, 21 Mar 2014 02:13:15 GMT
Change you mapper to be something like this:

public static class TokenizerMapper extends

      Mapper<Object, Text, Text, IntWritable> {


    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();


    public void map(Object key, Text value, Context context)

        throws IOException, InterruptedException {

      Path pp = ((FileSplit) context.getInputSplit()).getPath();

      StringTokenizer itr = new StringTokenizer(value.toString());

      log.info("map on string: " + new String(value.getBytes()));

      while (itr.hasMoreTokens()) {

        word.set(pp.getName() + " " + itr.nextToken());

        context.write(word, one);

      }

    }

  }

Note: add your filtering code here;

and then when running the command, use you input path as param;

Regards,
*Stanley Shi,*



On Fri, Mar 21, 2014 at 9:32 AM, Stanley Shi <sshi@gopivotal.com> wrote:

> Just reviewed the code again, you are not really using map-reduce. you are
> reading all files in one map process, this is not a normal map-reduce job
> works.
>
>
> Regards,
> *Stanley Shi,*
>
>
>
> On Thu, Mar 20, 2014 at 1:50 PM, Ranjini Rathinam <ranjinibecse@gmail.com>wrote:
>
>> Hi,
>>
>> If we give the below code,
>> =======================
>> word.set("filename"+"    "+tokenizer.nextToken());
>> output.collect(word,one);
>> ======================
>>
>> The output is wrong. because it shows the
>>
>> filename   word   occurance
>> vinitha       java       4
>> vinitha         oracle      3
>> sony           java       4
>> sony          oracle      3
>>
>>
>> Here vinitha does not have oracle word . Similarlly sony does not have
>> java has word. File name is merging for  all words.
>>
>> I need the output has given below
>>
>>  filename   word   occurance
>>
>> vinitha       java       4
>> vinitha         C++    3
>> sony           ETL     4
>> sony          oracle      3
>>
>>
>>  Need fileaName along with the word in that particular file only. No
>> merge should happen.
>>
>> Please help me out for this issue.
>>
>> Please help.
>>
>> Thanks in advance.
>>
>> Ranjini
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 10:56 AM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Stanley Shi <sshi@gopivotal.com>
>>> Date: Thu, Mar 20, 2014 at 7:39 AM
>>> Subject: Re: Need FileName with Content
>>> To: user@hadoop.apache.org
>>>
>>>
>>> You want to do a word count for each file, but the code give you a word
>>> count for all the files, right?
>>>
>>> =====
>>>  word.set(tokenizer.nextToken());
>>>           output.collect(word, one);
>>> ======
>>> change it to:
>>> word.set("filename"+"    "+tokenizer.nextToken());
>>> output.collect(word,one);
>>>
>>>
>>>
>>>
>>>  Regards,
>>> *Stanley Shi,*
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam <
>>> ranjinibecse@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have folder named INPUT.
>>>>
>>>> Inside INPUT i have 5 resume are there.
>>>>
>>>> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
>>>> Found 5 items
>>>> -rw-r--r--   1 hduser supergroup       5438 2014-03-18 15:20
>>>> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
>>>> -rw-r--r--   1 hduser supergroup       6022 2014-03-18 15:22
>>>> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
>>>> -rw-r--r--   1 hduser supergroup       3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/vinitha.txt
>>>> -rw-r--r--   1 hduser supergroup       3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/sony.txt
>>>> -rw-r--r--   1 hduser supergroup       3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/ravi.txt
>>>> hduser@localhost:~/Ranjini$
>>>>
>>>> I have to process the folder and the content .
>>>>
>>>> I need ouput has
>>>>
>>>> filename   word   occurance
>>>> vinitha       java       4
>>>> sony          oracle      3
>>>>
>>>>
>>>>
>>>> But iam not getting the filename.  Has the input file content are
>>>> merged file name is not getting correct .
>>>>
>>>>
>>>> please help in this issue to fix.  I have given by code below
>>>>
>>>>
>>>>  import java.io.IOException;
>>>>  import java.util.*;
>>>>  import org.apache.hadoop.fs.Path;
>>>>  import org.apache.hadoop.conf.*;
>>>>  import org.apache.hadoop.io.*;
>>>>  import org.apache.hadoop.mapred.*;
>>>>  import org.apache.hadoop.util.*;
>>>> import java.io.File;
>>>> import java.io.FileReader;
>>>> import java.io.FileWriter;
>>>> import java.io.IOException;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.FileStatus;
>>>> import org.apache.hadoop.conf.*;
>>>> import org.apache.hadoop.io.*;
>>>> import org.apache.hadoop.mapred.*;
>>>> import org.apache.hadoop.util.*;
>>>> import org.apache.hadoop.mapred.lib.*;
>>>>
>>>>  public class WordCount {
>>>>     public static class Map extends MapReduceBase implements
>>>> Mapper<LongWritable, Text, Text, IntWritable> {
>>>>      private final static IntWritable one = new IntWritable(1);
>>>>       private Text word = new Text();
>>>>       public void map(LongWritable key, Text value,
>>>> OutputCollector<Text, IntWritable> output, Reporter reporter) throws
>>>> IOException {
>>>>    FSDataInputStream fs=null;
>>>>    FileSystem hdfs = null;
>>>>    String line = value.toString();
>>>>          int i=0,k=0;
>>>>   try{
>>>>    Configuration configuration = new Configuration();
>>>>       configuration.set("fs.default.name", "hdfs://localhost:4440/");
>>>>
>>>>    Path srcPath = new Path("/user/hduser/INPUT/");
>>>>
>>>>    hdfs = FileSystem.get(configuration);
>>>>    FileStatus[] status = hdfs.listStatus(srcPath);
>>>>    fs=hdfs.open(srcPath);
>>>>    BufferedReader br=new BufferedReader(new
>>>> InputStreamReader(hdfs.open(srcPath)));
>>>>
>>>> String[] splited = line.split("\\s+");
>>>>     for( i=0;i<splited.length;i++)
>>>>  {
>>>>      String sp[]=splited[i].split(",");
>>>>      for( k=0;k<sp.length;k++)
>>>>  {
>>>>
>>>>    if(!sp[k].isEmpty()){
>>>> StringTokenizer tokenizer = new StringTokenizer(sp[k]);
>>>> if((sp[k].equalsIgnoreCase("C"))){
>>>>         while (tokenizer.hasMoreTokens()) {
>>>>           word.set(tokenizer.nextToken());
>>>>           output.collect(word, one);
>>>>         }
>>>> }
>>>> if((sp[k].equalsIgnoreCase("JAVA"))){
>>>>         while (tokenizer.hasMoreTokens()) {
>>>>           word.set(tokenizer.nextToken());
>>>>           output.collect(word, one);
>>>>         }
>>>> }
>>>>       }
>>>>     }
>>>> }
>>>>  } catch (IOException e) {
>>>>     e.printStackTrace();
>>>>  }
>>>> }
>>>> }
>>>>     public static class Reduce extends MapReduceBase implements
>>>> Reducer<Text, IntWritable, Text, IntWritable> {
>>>>       public void reduce(Text key, Iterator<IntWritable> values,
>>>> OutputCollector<Text, IntWritable> output, Reporter reporter) throws
>>>> IOException {
>>>>         int sum = 0;
>>>>         while (values.hasNext()) {
>>>>           sum += values.next().get();
>>>>         }
>>>>         output.collect(key, new IntWritable(sum));
>>>>       }
>>>>     }
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>
>>>>       JobConf conf = new JobConf(WordCount.class);
>>>>       conf.setJobName("wordcount");
>>>>       conf.setOutputKeyClass(Text.class);
>>>>       conf.setOutputValueClass(IntWritable.class);
>>>>       conf.setMapperClass(Map.class);
>>>>       conf.setCombinerClass(Reduce.class);
>>>>       conf.setReducerClass(Reduce.class);
>>>>       conf.setInputFormat(TextInputFormat.class);
>>>>       conf.setOutputFormat(TextOutputFormat.class);
>>>>       FileInputFormat.setInputPaths(conf, new Path(args[0]));
>>>>       FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>>>       JobClient.runJob(conf);
>>>>     }
>>>>  }
>>>>
>>>>
>>>>
>>>> Please help
>>>>
>>>> Thanks in advance.
>>>>
>>>> Ranjini
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message