hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Text search on a PDF file using hadoop
Date Wed, 30 Jul 2008 20:47:37 GMT
Well, PDF is a complicated beast.  A tool like PDFBox is designed to  
help, but the price is sometimes performance (but heh, it beats not  
being able to do it).  There are commercial converters available, but  
I can't say that they necessarily perform much better than PDFBox.   
I've broken a few of them with PDFs that PDFBox handles correctly.

I know of at least two projects that provide frameworks for dealing  
with PDFs (and other files like Word, etc.):  Tika (a Lucene  
subproject) and Aperture (http://aperture.sourceforge.net)   
Additionally, w/ PDFBox, there is no need to save the file back out if  
you just want the text, there are text extractors available that allow  
you to read in the file and then have the text in memory.  This may  
help w/ your perf. problem, as it is likely that a good deal of time  
is spent on the I/O.    See http://pdfbox.org/userguide/text_extraction.html 
  for doing this.

You might also search the Lucene Java mail archives (http://lucene.markmail.org 
) for PDF extraction.  This is something many Lucene users have  
tackled over time, so you may find more insight there.

Out of curiosity, what do you mean by "Hadoop Search"?

Cheers,
Grant

On Jul 30, 2008, at 7:25 AM, GaneshG wrote:

>
> Thanks Joman, i tried pdfbox, it converts pdfs to text files. On  
> these files
> hadoop search is working fine. but, performance aspect it is not  
> good, since
> we have to find first the file type is pdf or not then we have to  
> convert
> it. Also its generating txt files with same name of the original  
> pdf. so if
> we already have index.txt and we try to convert the index.pdf, then  
> it will
> be the problem for searches. Better we have to find someother way...
>
>
>
> Joman Chu-2 wrote:
>>
>> I've been investigating this recently, and I came across Apache  
>> PDFBox
>> (http://incubator.apache.org/projects/pdfbox.html), which may
>> accomplish this in native Java. Try it out and get back to us on how
>> well it works, I'd be curious to know.
>>
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>> On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <dhruba@gmail.com>
>> wrote:
>>> One option for you is to use a pdf-to-text converter (many of them  
>>> are
>>> available online) and then run map-reduce on the txt file.
>>>
>>> -dhruba
>>>
>>> On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
>>> <ganeshmuthukumar.g@cognizant.com> wrote:
>>>>
>>>> Thanks Lohit, i am using only defalult reader and i am very new to
>>>> hadoop.
>>>> This is my map method
>>>>
>>>>     public void map(LongWritable key, Text value,  
>>>> OutputCollector<Text,
>>>> Text> output, Reporter reporter) throws IOException {
>>>>       String line = value.toString();
>>>>       StringTokenizer tokenizer = new StringTokenizer(line);
>>>>       while (tokenizer.hasMoreTokens()) {
>>>>
>>>>               String val = tokenizer.nextToken();
>>>>               try {
>>>>
>>>>               if (val != null && val.contains("the")) {
>>>>                       word.set(line);
>>>>                       FileSplit spl =
>>>> (FileSplit)reporter.getInputSplit();
>>>>                       output.collect(word, new
>>>> Text(spl.getPath().getName()));
>>>>               }
>>>>               } catch (Exception e) {
>>>>                       System.out.println(e);
>>>>               }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> I have a pdf file in my dfs input folder. can you tell me what i  
>>>> have to
>>>> do
>>>> to read pdf files?
>>>>
>>>> Thanks
>>>> Ganesh.G
>>>>
>>>>
>>>> lohit-2 wrote:
>>>>>
>>>>> Can you provide more information. How are you passing your  
>>>>> input, are
>>>>> you
>>>>> passing raw pdf files? If so, are you using your own record  
>>>>> reader.
>>>>> Default record reader wont read pdf files and you wont get the  
>>>>> text out
>>>>> of
>>>>> it as is.
>>>>> Thanks,
>>>>> Lohit
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message ----
>>>>> From: GaneshG <ganeshmuthukumar.g@cognizant.com>
>>>>> To: core-user@hadoop.apache.org
>>>>> Sent: Wednesday, July 23, 2008 1:51:52 AM
>>>>> Subject: Text search on a PDF file using hadoop
>>>>>
>>>>>
>>>>> while i search a text in a pdf file using hadoop, the results  
>>>>> are not
>>>>> coming
>>>>> properly. i tried to debug my program, i could see the lines red  
>>>>> from
>>>>> pdf
>>>>> file is not formatted. please help me to resolve this.
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
>>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18731134.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>



Mime
View raw message