pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: problem with parsing pdf with PDFBox
Date Thu, 26 Mar 2015 09:24:21 GMT
Hello Anna,

> Am 26.03.2015 um 09:11 schrieb Golovko Anna <ann-golovko@yandex.ru>:
> 
> Hello!
> 
> My name is Anna Yakubenko. I'm a Java-developer and now support application, which can
parse pdf to txt with PDFBox and then store data to xml file as an output. Early every pdf
files were parsed by PDFBox properly, but now I have got a pdf file, which is parsed in the
way I couldn't expect. It seems, that customer add new layer with picture, colontitul and
footer to pdf. And now PDFBox extarct information only from colontitul and footer from every
page, and miss important information in the middle of the page. 
> 
> I use next source code to call PDFBox API:
> 
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintStream;
> import java.io.PrintWriter;
> import org.pdfbox.cos.COSDocument;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.pdmodel.PDDocumentInformation;
> import org.pdfbox.util.PDFTextStripper;
> 
> public class PDFTextParser
> {
>  PDFParser parser;
>  String parsedText;
>  PDFTextStripper pdfStripper;
>  PDDocument pdDoc;
>  COSDocument cosDoc;
>  PDDocumentInformation pdDocInfo;
> 
>  String pdftoText(String fileName)
>  {
>    System.out.println("Parsing text from PDF file " + fileName + "....");
>    File f = new File(fileName);
>    if (!f.isFile())
>    {
>      System.out.println("File " + fileName + " does not exist.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
>      this.parser = new PDFParser(new FileInputStream(f));
>    }
>    catch (Exception e)
>    {
>      System.out.println("Unable to open PDF Parser.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
>      this.parser.parse();
>      this.cosDoc = this.parser.getDocument();
>      this.pdfStripper = new PDFTextStripper();
>      this.pdDoc = new PDDocument(this.cosDoc);
>      this.parsedText = this.pdfStripper.getText(this.pdDoc);
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in parsing the PDF Document.");
>      e.printStackTrace();
>      try
>      {
>        if (this.cosDoc != null) {
>          this.cosDoc.close();
>        }
>        if (this.pdDoc != null) {
>          this.pdDoc.close();
>        }
>      }
>      catch (Exception e1)
>      {
>        e.printStackTrace();
>      }
>      return null;
>    }
>    System.out.println("Done.");
>    return this.parsedText;
>  }
> 
>  void writeTexttoFile(String pdfText, String fileName)
>  {
>    System.out.println("\nWriting PDF text to output text file " + fileName + "....");
>    try
>    {
>      PrintWriter pw = new PrintWriter(fileName);
>      pw.print(pdfText);
>      pw.close();
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in writing the pdf text to file.");
>      e.printStackTrace();
>    }
>    System.out.println("Done.");
>  }
> 
>  public static void main(String[] args)
>  {
>    if (args.length != 2)
>    {
>      System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
>      System.exit(1);
>    }
>    System.out.println(" MAIN: Beginn, alle beiden Dateien sind ├╝bergeben ");
>    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
>    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
>    PDFTextParser pdfTextParserObj = new PDFTextParser();
>    String pdfToText = pdfTextParserObj.pdftoText(args[0]);
>    if (pdfToText == null)
>    {
>      System.out.println("PDF to Text Conversion failed.");
>    }
>    else
>    {
>      System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
>      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
>    }
>  }
> }
> 

you could simplify your code a lot doing something similar to (haven't tested it - there might
be typos)  - as the typical way to parse a PDF document is by doing PDDocument.load which
does the rest in the background for you and already returns the PDDocument you need for the
PDFTextStripper

    void pdftoText(String pdfFile, String outputFile)
    {

        System.out.println("Parsing text from PDF file " + pdfFile + "....");
        File f = new File(pdfFile);
        if (!f.isFile())
        {
            System.out.println("File " + pdfFile + " does not exist.");
        }
        
        PDDocument pdDoc = null;
        Writer output = null;
        try
        {
            pdDoc = PDDocument.load(f);
            output = new OutputStreamWriter( new FileOutputStream( outputFile ));
            PDFTextStripper pdfStripper = new PDFTextStripper();
            pdfStripper.writeText(pdDoc, output);
        }
        catch (IOException e)
        {
            System.out.println("An exception occured in parsing the PDF Document.");
            e.printStackTrace();
        }
        finally
        {
            IOUtils.closeQuietly(pdDoc);
            IOUtils.closeQuietly(output);
        }

        System.out.println("Done.");
    }

In addition there is already a command line app ExtractText which does that for you. 



> 
> Could you advice me please, how can I extract all information from pdf file or at least
data from the middle of page, I don't really need text in colontitul and footer?
> 
> I can send my pdf and txt, if it is needed?
> 

wrt to the PDF could you upload it to a public location so we can give it a try.

BR
Maruan


> Many thanks in advanced!!!
> 
> Best regards,
> Anna Yakubenko
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message