Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Maruan Sahyoun <sahyoun@fileaffairs.de>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650"
Message-Id: <D8355A75-E18F-4E91-9797-5558096C9461@fileaffairs.de>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\))
Subject: Re: problem with parsing pdf with PDFBox
Date: Thu, 26 Mar 2015 10:24:21 +0100
References: <4723071427357478@web16o.yandex.ru>
To: users@pdfbox.apache.org
In-Reply-To: <4723071427357478@web16o.yandex.ru>

--Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hello Anna,

> Am 26.03.2015 um 09:11 schrieb Golovko Anna <ann-golovko@yandex.ru>:
>=20
> Hello!
>=20
> My name is Anna Yakubenko. I'm a Java-developer and now support =
application, which can parse pdf to txt with PDFBox and then store data =
to xml file as an output. Early every pdf files were parsed by PDFBox =
properly, but now I have got a pdf file, which is parsed in the way I =
couldn't expect. It seems, that customer add new layer with picture, =
colontitul and footer to pdf. And now PDFBox extarct information only =
from colontitul and footer from every page, and miss important =
information in the middle of the page.=20
>=20
> I use next source code to call PDFBox API:
>=20
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintStream;
> import java.io.PrintWriter;
> import org.pdfbox.cos.COSDocument;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.pdmodel.PDDocumentInformation;
> import org.pdfbox.util.PDFTextStripper;
>=20
> public class PDFTextParser
> {
>  PDFParser parser;
>  String parsedText;
>  PDFTextStripper pdfStripper;
>  PDDocument pdDoc;
>  COSDocument cosDoc;
>  PDDocumentInformation pdDocInfo;
>=20
>  String pdftoText(String fileName)
>  {
>    System.out.println("Parsing text from PDF file " + fileName + =
"....");
>    File f =3D new File(fileName);
>    if (!f.isFile())
>    {
>      System.out.println("File " + fileName + " does not exist.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird der Parser definiert: new =
PDFParser ");
>      this.parser =3D new PDFParser(new FileInputStream(f));
>    }
>    catch (Exception e)
>    {
>      System.out.println("Unable to open PDF Parser.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
>      this.parser.parse();
>      this.cosDoc =3D this.parser.getDocument();
>      this.pdfStripper =3D new PDFTextStripper();
>      this.pdDoc =3D new PDDocument(this.cosDoc);
>      this.parsedText =3D this.pdfStripper.getText(this.pdDoc);
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in parsing the PDF =
Document.");
>      e.printStackTrace();
>      try
>      {
>        if (this.cosDoc !=3D null) {
>          this.cosDoc.close();
>        }
>        if (this.pdDoc !=3D null) {
>          this.pdDoc.close();
>        }
>      }
>      catch (Exception e1)
>      {
>        e.printStackTrace();
>      }
>      return null;
>    }
>    System.out.println("Done.");
>    return this.parsedText;
>  }
>=20
>  void writeTexttoFile(String pdfText, String fileName)
>  {
>    System.out.println("\nWriting PDF text to output text file " + =
fileName + "....");
>    try
>    {
>      PrintWriter pw =3D new PrintWriter(fileName);
>      pw.print(pdfText);
>      pw.close();
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in writing the pdf text =
to file.");
>      e.printStackTrace();
>    }
>    System.out.println("Done.");
>  }
>=20
>  public static void main(String[] args)
>  {
>    if (args.length !=3D 2)
>    {
>      System.out.println("Usage: java PDFTextParser <InputPDFFilename> =
<OutputTextFile>");
>      System.exit(1);
>    }
>    System.out.println(" MAIN: Beginn, alle beiden Dateien sind =
=C3=BCbergeben ");
>    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
>    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
>    PDFTextParser pdfTextParserObj =3D new PDFTextParser();
>    String pdfToText =3D pdfTextParserObj.pdftoText(args[0]);
>    if (pdfToText =3D=3D null)
>    {
>      System.out.println("PDF to Text Conversion failed.");
>    }
>    else
>    {
>      System.out.println("\nThe text parsed from the PDF =
Document....\n" + pdfToText);
>      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
>    }
>  }
> }
>=20

you could simplify your code a lot doing something similar to (haven't =
tested it - there might be typos)  - as the typical way to parse a PDF =
document is by doing PDDocument.load which does the rest in the =
background for you and already returns the PDDocument you need for the =
PDFTextStripper

    void pdftoText(String pdfFile, String outputFile)
    {

        System.out.println("Parsing text from PDF file " + pdfFile + =
"....");
        File f =3D new File(pdfFile);
        if (!f.isFile())
        {
            System.out.println("File " + pdfFile + " does not exist.");
        }
       =20
        PDDocument pdDoc =3D null;
        Writer output =3D null;
        try
        {
            pdDoc =3D PDDocument.load(f);
            output =3D new OutputStreamWriter( new FileOutputStream( =
outputFile ));
            PDFTextStripper pdfStripper =3D new PDFTextStripper();
            pdfStripper.writeText(pdDoc, output);
        }
        catch (IOException e)
        {
            System.out.println("An exception occured in parsing the PDF =
Document.");
            e.printStackTrace();
        }
        finally
        {
            IOUtils.closeQuietly(pdDoc);
            IOUtils.closeQuietly(output);
        }

        System.out.println("Done.");
    }

In addition there is already a command line app ExtractText which does =
that for you.=20


>=20
> Could you advice me please, how can I extract all information from pdf =
file or at least data from the middle of page, I don't really need text =
in colontitul and footer?
>=20
> I can send my pdf and txt, if it is needed?
>=20

wrt to the PDF could you upload it to a public location so we can give =
it a try.

BR
Maruan


> Many thanks in advanced!!!
>=20
> Best regards,
> Anna Yakubenko
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>=20


--Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650--