Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 770B9200C4D for ; Wed, 5 Apr 2017 22:02:49 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 75B82160B94; Wed, 5 Apr 2017 20:02:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BA96F160B76 for ; Wed, 5 Apr 2017 22:02:48 +0200 (CEST) Received: (qmail 13320 invoked by uid 500); 5 Apr 2017 20:02:47 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 13309 invoked by uid 99); 5 Apr 2017 20:02:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Apr 2017 20:02:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9176E1A7A9C for ; Wed, 5 Apr 2017 20:02:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.3 X-Spam-Level: X-Spam-Status: No, score=0.3 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id dAot5MdAMCT9 for ; Wed, 5 Apr 2017 20:02:44 +0000 (UTC) Received: from www168.your-server.de (www168.your-server.de [213.133.104.168]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EFDDE5F24C for ; Wed, 5 Apr 2017 20:02:43 +0000 (UTC) Received: from [88.198.220.132] (helo=sslproxy03.your-server.de) by www168.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.85_2) (envelope-from ) id 1cvr8X-0001ZE-7F for users@pdfbox.apache.org; Wed, 05 Apr 2017 22:02:37 +0200 Received: from [2a02:908:740:c620:9d3a:bbfc:ec52:1817] by sslproxy03.your-server.de with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.84_2) (envelope-from ) id 1cvr8W-00076a-UK for users@pdfbox.apache.org; Wed, 05 Apr 2017 22:02:37 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Problem extracting and processing text from a PDF From: Maruan Sahyoun In-Reply-To: Date: Wed, 5 Apr 2017 22:02:36 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <4F56F420-CEE4-49D9-9CA0-BC242BDFEBC4@fileaffairs.de> References: To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.3124) X-Authenticated-Sender: sahyoun@fileaffairs.de X-Virus-Scanned: Clear (ClamAV 0.99.2/23269/Wed Apr 5 14:41:37 2017) archived-at: Wed, 05 Apr 2017 20:02:49 -0000 Hi, > Am 05.04.2017 um 21:46 schrieb David Patterson = : >=20 > Hello, >=20 >=20 >=20 > I=E2=80=99m trying to extract the text from a PDF that was saved from = a Word > document. >=20 >=20 >=20 > I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a > Windows machine. >=20 >=20 >=20 > I=E2=80=99m using the following code to get the text: >=20 >=20 >=20 > PDDocument pdDocument =3D PDDocument.load( pdfFile ); >=20 > PDFTextStripper stripper =3D new PDFTextStripper(); >=20 > String rawText =3D stripper.getText( pdDocument ); >=20 > // end of code excerpt >=20 >=20 >=20 > I=E2=80=99m running the same code on a collection of files. Most work = as expected. > I can see the following in the text of the Table of Contents: >=20 > 5.15.1 ADDENDA..................................................... > ................................. 1 >=20 > 5.15.2 YOU ARE HERE .............................. > .............................................. 2 >=20 > 5.15.3 INTRODUCTION .............................. > .............................................. 4 >=20 >=20 >=20 > However, for two files, what I see is: >=20 > 5.16 xxx SYSTEM PROCEDURES > ............................................................ > 1 >=20 > ADDENDA...................................... > ......................................................... 1 5.16.1 >=20 > YOU ARE HERE .............................. > ........................................................ > 2 5.16.2 >=20 > INTRODUCTION = ..........................................................................= ............. > 4 5.16.3 >=20 >=20 >=20 > Note: the outline numbers (5.16.1, etc.) are at the end of the line, = not at > the beginning. >=20 >=20 >=20 > A) Is this a known, solvable problem? >=20 > B) If not, is there a different way I can try to extract the data? >=20 > C) If not, can I help debug/diagnose the problem? I cannot send the > offending PDF file out of my system. try PDFTextStripper.setSortByPosition(true);=20 = https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTe= xtStripper.html#setSortByPosition(boolean) BR Maruan >=20 > Thanks >=20 >=20 >=20 > Dave Patterson --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org