pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Problem when extracting text from a pdf file
Date Wed, 21 May 2014 09:14:11 GMT
Hi Mehmet,

sry - now I see your issue. It’s an encoding issue of the PDF. Copying & Pasting using
Adobe Reader gives the same result. I don’t think that we can do very much about it but
I’ll look into it in more detail. 

BR

Maruan Sahyoun

Am 21.05.2014 um 11:06 schrieb Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayoglu@kuleuven.be>:

> Dear Maruan,
> 
> I have checked them again. I am sure that they are correct ones.
> 
> The pdf coming from the first link has a title of "Olfactory Learning-Induced Increase
in Spine Density Along the...Neurons". I can process for this pdf.
> 
> The second one has a title : "Relationship between intercepted radiation, net photosynthesis,
respiration, and rate of ....densities". I could not handle this one.
> 
> Indeed, when I copy and paste some text from this pdf, what I get is like: 
> 
> *
> 
>  
> 
> 
> 
> When you extract the text from the second one, did you make use of the java script that
I sent in my first mail or use another one?
> 
> 
> Thanks for your attention.
> 
> Best regards,
> Mehmet 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Wednesday 21 May 2014 8:51 AM
> To: users@pdfbox.apache.org
> Subject: Re: Problem when extracting text from a pdf file
> 
> Dear Mehmet,
> 
> did you supply the correct PDF's? I can manual copy & paste text from both as well
as extract the text using PDFBox for both.
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayoglu@kuleuven.be>:
> 
>> Dear Maruan,
>> 
>> Thanks for your reply. Below you can find the related links for the pdf files. As
you state, from the first pdf (dnm1) I can manually copy paste the text while this is not
possible for the second one (pdf) which shows that the later one contains no real text.
>> 
>> Is there any other ways to extract text from such pdfs like dnm2?
>> 
>> dnm1.pdf:
>> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
>> 
>> dnm2.pdf:
>> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday 16 May 2014 10:20 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Problem when extracting text from a pdf file
>> 
>> Hi Mehmet,
>> 
>> it could well be that text extraction works for one PDF and doesn't for another as
it might not contain real text but what you see on screen is drawn. As the attachments didn't
make it through because of restrictions on the mailing list could you upload these to a public
location to take a look at the files so the answer can be more specific for your case?
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayoglu@kuleuven.be>:
>> 
>>> Dear all,
>>> 
>>> As part of my research, I am trying to convert pdf files to text files. I have
applied both itext and pdfbox but I encounter the same issue.
>>> 
>>> When I try extracting text from dnm1.pdf file (attached) both approaches work
well. However when applying them for dnm2.pdf they fail.
>>> 
>>> I retrieve a text file with full of NULL values. Is it normal for such differently
shaped pdfs or am I missing something else?
>>> 
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Mehmet
>>> 
>>> 
>>> My code:
>>> 
>>> package retrievingfulltetxsfromweb;
>>> 
>>> import connectingurl.PlacesApi;
>>> 
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.io.IOException;
>>> import org.apache.pdfbox.cos.COSDocument;
>>> import org.apache.pdfbox.pdfparser.PDFParser;
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.util.PDFTextStripper;
>>> 
>>> public class PdfBox {
>>> 
>>>   // Extract text from PDF Document
>>>           public PdfBox(String fileName) {
>>>                   //PDFParser parser = new PDFParser();
>>>                   String parsedText = null;;
>>>                   PDFTextStripper pdfStripper = null;
>>>                   PDDocument pdDoc = null;
>>>                   COSDocument cosDoc = null;
>>>                   File file = new File(fileName);
>>>                   if (!file.isFile()) {
>>>                           System.err.println("File " + fileName + " does not
exist.");
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                   } catch (IOException e) {
>>>                           System.err.println("Unable to open PDF Parser. " +
e.getMessage());
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                           parser.parse();
>>>                           cosDoc = parser.getDocument();
>>>                           pdfStripper = new PDFTextStripper();
>>>                           pdDoc = new PDDocument(cosDoc);
>>>                           pdfStripper.setStartPage(1);
>>>                           pdfStripper.setEndPage(5);
>>>                           parsedText = pdfStripper.getText(pdDoc);
>>>                       System.out.println(parsedText);
>>>                   } catch (Exception e) {
>>>                           System.err
>>>                                           .println("An exception occured in parsing
the PDF Document."
>>>                                                           + e.getMessage());
>>>                   } finally {
>>>                           try {
>>>                                   if (cosDoc != null)
>>>                                           cosDoc.close();
>>>                                   if (pdDoc != null)
>>>                                           pdDoc.close();
>>>                           } catch (Exception e) {
>>>                                   e.printStackTrace();
>>>                           }
>>>                   }
>>>                   //return parsedText;
>>>           }
>>>           public static void main(String args[]){
>>> 
>>>               PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>>                  // System.out.println(pdftoText("C:/dnm1.pdf"));
>>>           }
>>> 
>>> }
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message