Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16DEF1739C for ; Thu, 26 Mar 2015 09:27:09 +0000 (UTC) Received: (qmail 67957 invoked by uid 500); 26 Mar 2015 09:27:03 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 67932 invoked by uid 500); 26 Mar 2015 09:27:03 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 67920 invoked by uid 99); 26 Mar 2015 09:27:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2015 09:27:03 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_MY_NAME_IS,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [213.133.104.168] (HELO www168.your-server.de) (213.133.104.168) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2015 09:26:36 +0000 Received: from [88.198.220.130] (helo=sslproxy01.your-server.de) by www168.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.80.1) (envelope-from ) id 1Yb41c-0006Zo-Ox for users@pdfbox.apache.org; Thu, 26 Mar 2015 10:24:28 +0100 Received: from [79.242.121.53] (helo=mbp001.intern) by sslproxy01.your-server.de with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.80) (envelope-from ) id 1Yb41Z-00054j-8G for users@pdfbox.apache.org; Thu, 26 Mar 2015 10:24:25 +0100 From: Maruan Sahyoun Content-Type: multipart/alternative; boundary="Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: problem with parsing pdf with PDFBox Date: Thu, 26 Mar 2015 10:24:21 +0100 References: <4723071427357478@web16o.yandex.ru> To: users@pdfbox.apache.org In-Reply-To: <4723071427357478@web16o.yandex.ru> X-Mailer: Apple Mail (2.2070.6) X-Authenticated-Sender: sahyoun@fileaffairs.de X-Virus-Scanned: Clear (ClamAV 0.98.5/20243/Thu Mar 26 05:48:39 2015) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hello Anna, > Am 26.03.2015 um 09:11 schrieb Golovko Anna : >=20 > Hello! >=20 > My name is Anna Yakubenko. I'm a Java-developer and now support = application, which can parse pdf to txt with PDFBox and then store data = to xml file as an output. Early every pdf files were parsed by PDFBox = properly, but now I have got a pdf file, which is parsed in the way I = couldn't expect. It seems, that customer add new layer with picture, = colontitul and footer to pdf. And now PDFBox extarct information only = from colontitul and footer from every page, and miss important = information in the middle of the page.=20 >=20 > I use next source code to call PDFBox API: >=20 > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintStream; > import java.io.PrintWriter; > import org.pdfbox.cos.COSDocument; > import org.pdfbox.pdfparser.PDFParser; > import org.pdfbox.pdmodel.PDDocument; > import org.pdfbox.pdmodel.PDDocumentInformation; > import org.pdfbox.util.PDFTextStripper; >=20 > public class PDFTextParser > { > PDFParser parser; > String parsedText; > PDFTextStripper pdfStripper; > PDDocument pdDoc; > COSDocument cosDoc; > PDDocumentInformation pdDocInfo; >=20 > String pdftoText(String fileName) > { > System.out.println("Parsing text from PDF file " + fileName + = "...."); > File f =3D new File(fileName); > if (!f.isFile()) > { > System.out.println("File " + fileName + " does not exist."); > return null; > } > try > { > System.out.println("Jetzt wird der Parser definiert: new = PDFParser "); > this.parser =3D new PDFParser(new FileInputStream(f)); > } > catch (Exception e) > { > System.out.println("Unable to open PDF Parser."); > return null; > } > try > { > System.out.println("Jetzt wird mit dem Parser gearbeitet: "); > this.parser.parse(); > this.cosDoc =3D this.parser.getDocument(); > this.pdfStripper =3D new PDFTextStripper(); > this.pdDoc =3D new PDDocument(this.cosDoc); > this.parsedText =3D this.pdfStripper.getText(this.pdDoc); > } > catch (Exception e) > { > System.out.println("An exception occured in parsing the PDF = Document."); > e.printStackTrace(); > try > { > if (this.cosDoc !=3D null) { > this.cosDoc.close(); > } > if (this.pdDoc !=3D null) { > this.pdDoc.close(); > } > } > catch (Exception e1) > { > e.printStackTrace(); > } > return null; > } > System.out.println("Done."); > return this.parsedText; > } >=20 > void writeTexttoFile(String pdfText, String fileName) > { > System.out.println("\nWriting PDF text to output text file " + = fileName + "...."); > try > { > PrintWriter pw =3D new PrintWriter(fileName); > pw.print(pdfText); > pw.close(); > } > catch (Exception e) > { > System.out.println("An exception occured in writing the pdf text = to file."); > e.printStackTrace(); > } > System.out.println("Done."); > } >=20 > public static void main(String[] args) > { > if (args.length !=3D 2) > { > System.out.println("Usage: java PDFTextParser = "); > System.exit(1); > } > System.out.println(" MAIN: Beginn, alle beiden Dateien sind = =C3=BCbergeben "); > System.out.println(" MAIN: PDF-Datei (arg 0) : " + args[0]); > System.out.println(" MAIN: Text-Datei (arg 1) : " + args[1]); > PDFTextParser pdfTextParserObj =3D new PDFTextParser(); > String pdfToText =3D pdfTextParserObj.pdftoText(args[0]); > if (pdfToText =3D=3D null) > { > System.out.println("PDF to Text Conversion failed."); > } > else > { > System.out.println("\nThe text parsed from the PDF = Document....\n" + pdfToText); > pdfTextParserObj.writeTexttoFile(pdfToText, args[1]); > } > } > } >=20 you could simplify your code a lot doing something similar to (haven't = tested it - there might be typos) - as the typical way to parse a PDF = document is by doing PDDocument.load which does the rest in the = background for you and already returns the PDDocument you need for the = PDFTextStripper void pdftoText(String pdfFile, String outputFile) { System.out.println("Parsing text from PDF file " + pdfFile + = "...."); File f =3D new File(pdfFile); if (!f.isFile()) { System.out.println("File " + pdfFile + " does not exist."); } =20 PDDocument pdDoc =3D null; Writer output =3D null; try { pdDoc =3D PDDocument.load(f); output =3D new OutputStreamWriter( new FileOutputStream( = outputFile )); PDFTextStripper pdfStripper =3D new PDFTextStripper(); pdfStripper.writeText(pdDoc, output); } catch (IOException e) { System.out.println("An exception occured in parsing the PDF = Document."); e.printStackTrace(); } finally { IOUtils.closeQuietly(pdDoc); IOUtils.closeQuietly(output); } System.out.println("Done."); } In addition there is already a command line app ExtractText which does = that for you.=20 >=20 > Could you advice me please, how can I extract all information from pdf = file or at least data from the middle of page, I don't really need text = in colontitul and footer? >=20 > I can send my pdf and txt, if it is needed? >=20 wrt to the PDF could you upload it to a public location so we can give = it a try. BR Maruan > Many thanks in advanced!!! >=20 > Best regards, > Anna Yakubenko >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org >=20 --Apple-Mail=_18005EDE-61A7-46A1-B172-6D0A65D97650--