hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joman Chu" <jom...@andrew.cmu.edu>
Subject Re: Text search on a PDF file using hadoop
Date Wed, 23 Jul 2008 21:30:43 GMT
I've been investigating this recently, and I came across Apache PDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <dhruba@gmail.com> wrote:
> One option for you is to use a pdf-to-text converter (many of them are
> available online) and then run map-reduce on the txt file.
>
> -dhruba
>
> On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
> <ganeshmuthukumar.g@cognizant.com> wrote:
>>
>> Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
>> This is my map method
>>
>>      public void map(LongWritable key, Text value, OutputCollector<Text,
>> Text> output, Reporter reporter) throws IOException {
>>        String line = value.toString();
>>        StringTokenizer tokenizer = new StringTokenizer(line);
>>        while (tokenizer.hasMoreTokens()) {
>>
>>                String val = tokenizer.nextToken();
>>                try {
>>
>>                if (val != null && val.contains("the")) {
>>                        word.set(line);
>>                        FileSplit spl = (FileSplit)reporter.getInputSplit();
>>                        output.collect(word, new Text(spl.getPath().getName()));
>>                }
>>                } catch (Exception e) {
>>                        System.out.println(e);
>>                }
>>        }
>>      }
>>    }
>>
>> I have a pdf file in my dfs input folder. can you tell me what i have to do
>> to read pdf files?
>>
>> Thanks
>> Ganesh.G
>>
>>
>> lohit-2 wrote:
>>>
>>> Can you provide more information. How are you passing your input, are you
>>> passing raw pdf files? If so, are you using your own record reader.
>>> Default record reader wont read pdf files and you wont get the text out of
>>> it as is.
>>> Thanks,
>>> Lohit
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: GaneshG <ganeshmuthukumar.g@cognizant.com>
>>> To: core-user@hadoop.apache.org
>>> Sent: Wednesday, July 23, 2008 1:51:52 AM
>>> Subject: Text search on a PDF file using hadoop
>>>
>>>
>>> while i search a text in a pdf file using hadoop, the results are not
>>> coming
>>> properly. i tried to debug my program, i could see the lines red from pdf
>>> file is not formatted. please help me to resolve this.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>
>

Mime
View raw message