pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayo...@kuleuven.be>
Subject RE: Problem when extracting text from a pdf file
Date Tue, 20 May 2014 09:56:20 GMT
Dear Maruan,

Thanks for your reply. Below you can find the related links for the pdf files. As you state,
from the first pdf (dnm1) I can manually copy paste the text while this is not possible for
the second one (pdf) which shows that the later one contains no real text.

Is there any other ways to extract text from such pdfs like dnm2?

dnm1.pdf:
http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf

dnm2.pdf:
http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf

Regards,
Mehmet




-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Friday 16 May 2014 10:20 AM
To: users@pdfbox.apache.org
Subject: Re: Problem when extracting text from a pdf file

Hi Mehmet,

it could well be that text extraction works for one PDF and doesn't for another as it might
not contain real text but what you see on screen is drawn. As the attachments didn't make
it through because of restrictions on the mailing list could you upload these to a public
location to take a look at the files so the answer can be more specific for your case?

BR

Maruan Sahyoun

Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayoglu@kuleuven.be>:

> Dear all,
>  
> As part of my research, I am trying to convert pdf files to text files. I have applied
both itext and pdfbox but I encounter the same issue.
>  
> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However
when applying them for dnm2.pdf they fail.
>  
> I retrieve a text file with full of NULL values. Is it normal for such differently shaped
pdfs or am I missing something else?
>  
> Thanks in advance.
>  
> Regards,
> Mehmet
>  
>  
> My code:
>  
> package retrievingfulltetxsfromweb;
>  
> import connectingurl.PlacesApi;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>  
> public class PdfBox {
>    
>     // Extract text from PDF Document
>             public PdfBox(String fileName) {
>                     //PDFParser parser = new PDFParser();
>                     String parsedText = null;;
>                     PDFTextStripper pdfStripper = null;
>                     PDDocument pdDoc = null;
>                     COSDocument cosDoc = null;
>                     File file = new File(fileName);
>                     if (!file.isFile()) {
>                             System.err.println("File " + fileName + " does not exist.");
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                     } catch (IOException e) {
>                             System.err.println("Unable to open PDF Parser. " + e.getMessage());
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                             parser.parse();
>                             cosDoc = parser.getDocument();
>                             pdfStripper = new PDFTextStripper();
>                             pdDoc = new PDDocument(cosDoc);
>                             pdfStripper.setStartPage(1);
>                             pdfStripper.setEndPage(5);
>                             parsedText = pdfStripper.getText(pdDoc);
>                         System.out.println(parsedText);
>                     } catch (Exception e) {
>                             System.err
>                                             .println("An exception occured in parsing
the PDF Document."
>                                                             + e.getMessage());
>                     } finally {
>                             try {
>                                     if (cosDoc != null)
>                                             cosDoc.close();
>                                     if (pdDoc != null)
>                                             pdDoc.close();
>                             } catch (Exception e) {
>                                     e.printStackTrace();
>                             }
>                     }
>                     //return parsedText;
>             }
>             public static void main(String args[]){
>                    
>                 PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>                    // System.out.println(pdftoText("C:/dnm1.pdf"));
>             }
>  
> }


Mime
View raw message