hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GaneshG <ganeshmuthukuma...@cognizant.com>
Subject Re: Text search on a PDF file using hadoop
Date Wed, 23 Jul 2008 09:07:46 GMT

Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
This is my map method

      public void map(LongWritable key, Text value, OutputCollector<Text,
Text> output, Reporter reporter) throws IOException {  
        String line = value.toString();  
        StringTokenizer tokenizer = new StringTokenizer(line);  
        while (tokenizer.hasMoreTokens()) {
        	String val = tokenizer.nextToken();
        	try {
            	if (val != null && val.contains("the")) {
    		    	FileSplit spl = (FileSplit)reporter.getInputSplit();
    		    	output.collect(word, new Text(spl.getPath().getName()));
    		} catch (Exception e) {

I have a pdf file in my dfs input folder. can you tell me what i have to do
to read pdf files?


lohit-2 wrote:
> Can you provide more information. How are you passing your input, are you
> passing raw pdf files? If so, are you using your own record reader.
> Default record reader wont read pdf files and you wont get the text out of
> it as is. 
> Thanks,
> Lohit
> ----- Original Message ----
> From: GaneshG <ganeshmuthukumar.g@cognizant.com>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, July 23, 2008 1:51:52 AM
> Subject: Text search on a PDF file using hadoop
> while i search a text in a pdf file using hadoop, the results are not
> coming
> properly. i tried to debug my program, i could see the lines red from pdf
> file is not formatted. please help me to resolve this.
> -- 
> View this message in context:
> http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

View this message in context: http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message