From users-return-11287-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org  Fri Nov  2 23:37:51 2018
Return-Path: <users-return-11287-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 212EA18062B
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  2 Nov 2018 23:37:50 +0100 (CET)
Received: (qmail 17263 invoked by uid 500); 2 Nov 2018 22:37:50 -0000
Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:users-help@pdfbox.apache.org>
List-Unsubscribe: <mailto:users-unsubscribe@pdfbox.apache.org>
List-Post: <mailto:users@pdfbox.apache.org>
List-Id: <users.pdfbox.apache.org>
Reply-To: users@pdfbox.apache.org
Delivered-To: mailing list users@pdfbox.apache.org
Received: (qmail 17250 invoked by uid 99); 2 Nov 2018 22:37:49 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2018 22:37:49 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B431DC2252
	for <users@pdfbox.apache.org>; Fri,  2 Nov 2018 22:37:48 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.889
X-Spam-Level: *
X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001,
	T_DKIMWL_WL_MED=-0.01] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id 91g3Zuz4wQKV for <users@pdfbox.apache.org>;
	Fri,  2 Nov 2018 22:37:47 +0000 (UTC)
Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2F5D55F324
	for <users@pdfbox.apache.org>; Fri,  2 Nov 2018 22:37:47 +0000 (UTC)
Received: by mail-oi1-f172.google.com with SMTP id c25-v6so2880244oiy.0
        for <users@pdfbox.apache.org>; Fri, 02 Nov 2018 15:37:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:from:date:message-id:subject:to;
        bh=uit1Q0h1mWM46sVsU3HEddyEFqyYIudewVdh3tHgbog=;
        b=iqWzTp2NSiL7GXBxiE4Lywwu07cmPT3nNlgs1zSHroKv/mx8b7ag5hwRj5NMpEQta5
         ikszb1eaYWUS9VwbtYRWeJXArwyu/StffZZnDT56c9lXetBmHOsigTbiZEMdH4tVVwcn
         yxvIqEpqHQmDQmnIvFM5bxegJsiV2dHQKyjeFPLWfCTE84EL2swHXvqN4KKSVx+U51Gz
         YiV0IHrsNLft7LcGpNuKb0+DxP2Za3mpSvdef+S9exHhzL+0L7ccrjx3ZeEoHXlxAfR6
         xDkMfuNXp1iWvnzyve5/QKrGjq0Wcl/vtu2ORqTnWEe0Vnalr1GG2ril6nIV3LzyT5Zb
         ae7g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
        bh=uit1Q0h1mWM46sVsU3HEddyEFqyYIudewVdh3tHgbog=;
        b=PZ9SSEv5pweElufH8ShhgrJo01u3P8q7zebY1vVCoACE7Prt3LRrv9m/mCQydo4WfP
         Re3fUJYtt4claZuDFO5UoAgQhFEGGFz5hF5DWCQl+1i7CxwcQRwd4VTirwzy7ryiw/J5
         mEoK2pUo+ChiJoMMapWeEvreAxrVuwCgo8xoYscPVE692hGLXcTuFj8S1m/LpgYkpTvS
         y+2bz5wHn1/CeSeS7URwSqtvHjGRj5pnit1XXi6ZoMnwd4m8xzkOAaN2FHa9xy8NmEW5
         2u4Fg9gT8G0/r4lm7V4cjGgmmLxislSkvLELWGfBjv+NlydzChv9veNo3X1UqVoWN5Rk
         Hw5g==
X-Gm-Message-State: AGRZ1gLLZvS+aWhXq92bZXtYrP/hUk6fRHQ6kl70HTZKD0nO/Tw1x/5J
	ZAF9TWSIM9AjSuC4EYFW1z+xHxZNySa9lR35Qr2868bR
X-Google-Smtp-Source: AJdET5cUwqeS27kkfLNdRXfr89b8QSa8eaIy44CIGFORoGEaFyBFZ8PGDGeWqFn4svWNCAZrjwI1/e2yt5GbZoL5yu4=
X-Received: by 2002:aca:af04:: with SMTP id y4-v6mr7355394oie.274.1541198266111;
 Fri, 02 Nov 2018 15:37:46 -0700 (PDT)
MIME-Version: 1.0
From: jorgeeflorez <jorgeeduardoflorez@gmail.com>
Date: Fri, 2 Nov 2018 17:37:22 -0500
Message-ID: <CAAbeTbdZhS5LSYhSFGTRgAX2EL7zze7st94eCWskHatvG8j=pA@mail.gmail.com>
Subject: Extracting page "correctly"
To: users@pdfbox.apache.org
Content-Type: multipart/mixed; boundary="0000000000002e3f4e0579b62f2c"

--0000000000002e3f4e0579b62f2c
Content-Type: multipart/alternative; boundary="0000000000002e3f4a0579b62f2a"

--0000000000002e3f4a0579b62f2a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi all,
I want to extract the text from the page of this PDF file
<https://drive.google.com/file/d/1RMBmU2XTaSgQVDkU2eYECP8fe2SjVqFp/view?usp=
=3Dsharing>.
I am using the following code to achieve it:

try (PDDocument document =3D PDDocument.load(new File(fileName)))
{
    PDFTextStripper stripper =3D new PDFTextStripper();
    stripper.setSortByPosition( false );
    stripper.setStartPage( 0 );
    stripper.setEndPage( document.getNumberOfPages() );

    System.out.println(stripper.getText(document));
}

The result I get (part of it) is:

----------------
A
 S
am
pl
e
P
os
te
r
La
nd
sc
ap
e
La
yo
ut
----------------

If I use  stripper.setSortByPosition( true ) I get the following (part of
it):

----------------
A Sample Poster  Landscape
Layout - Title
Name of Researcher(s)
Name of Department
Introduction Measurable Outcomes
The Mechanical Engineering Department at WPI was established in 1868 and
the first
undergraduate degrees were awarded in 1871. The Department *currently has
about 450 Graduating students* should demonstrate the following at a level
equivalent to an entry-
undergraduate students and 100 graduate students. Housed in the Higgins
Laboratory and the level engineer or first year graduate student:
Washburn shops the faculty consists of 29 tenured and tenure track
professors, and several
non-tenure track teaching staff. The Department offers undergraduate and
graduate degrees in a. An understanding of the fundamental principles of
conservation laws,
----------------

The text I get is better than the first one, but it mixes the text from
left and right "columns" (please see the bold text).
My question is: is it possible to get the text as one would naturally read
it? i.e. the text of the left column and then the text of the right column?

I attached the file, just in case the link cannot be opened.
Thanks in advance.
Best Regards.
Jorge Eduardo Fl=C3=B3rez

--0000000000002e3f4a0579b62f2a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div>Hi=
 all,</div><div>I want to extract the text from the page of this PDF <a hre=
f=3D"https://drive.google.com/file/d/1RMBmU2XTaSgQVDkU2eYECP8fe2SjVqFp/view=
?usp=3Dsharing">file</a>. I am using the following code to achieve it:<br><=
br></div><div>try (PDDocument document =3D PDDocument.load(new File(fileNam=
e)))<br>{<br>=C2=A0=C2=A0=C2=A0 PDFTextStripper stripper =3D new PDFTextStr=
ipper();<br>=C2=A0=C2=A0=C2=A0 stripper.setSortByPosition( false );<br>=C2=
=A0=C2=A0=C2=A0 stripper.setStartPage( 0 );<br>=C2=A0=C2=A0=C2=A0 stripper.=
setEndPage( document.getNumberOfPages() );<br><br>=C2=A0=C2=A0=C2=A0 System=
.out.println(stripper.getText(document));<br>}<br><br></div><div>The result=
 I get (part of it) is:</div><div><br></div><div>----------------<br></div>=
<div>A<br>=C2=A0S<br>am<br>pl<br>e <br>P<br>os<br>te<br>r=C2=A0 <br>La<br>n=
d<br>sc<br>ap<br>e <br>La<br>yo<br>ut<br>----------------<br></div><div><br=
></div><div>If I use=C2=A0
stripper.setSortByPosition( true ) I get the following (part of it):<br></d=
iv><div><br></div><div>----------------<br>A Sample Poster=C2=A0 Landscape =
<br>Layout - Title<br>Name of Researcher(s)<br>Name of Department<br>Introd=
uction Measurable Outcomes<br>The Mechanical Engineering Department at WPI =
was established in 1868 and the first <br>undergraduate degrees were awarde=
d in 1871. The Department <b>currently has about 450 Graduating students</b=
> should demonstrate the following at a level equivalent to an entry-<br>un=
dergraduate students and 100 graduate students. Housed in the Higgins Labor=
atory and the level engineer or first year graduate student: <br>Washburn s=
hops the faculty consists of 29 tenured and tenure track professors, and se=
veral <br>non-tenure track teaching staff. The Department offers undergradu=
ate and graduate degrees in a. An understanding of the fundamental principl=
es of conservation laws, <br>----------------<br></div><div><br></div><div>=
The text I get is better than the first one, but it mixes the text from lef=
t and right &quot;columns&quot; (please see the bold text). <br></div><div>=
My question is: is it possible to get the text as one would naturally read =
it? i.e. the text of the left column and then the text of the right column?=
<br><br></div><div>I attached the file, just in case the link cannot be ope=
ned.<br></div><div>Thanks in advance.</div><div> Best Regards.<br></div><di=
v>Jorge Eduardo Fl=C3=B3rez<br></div></div></div></div></div>

--0000000000002e3f4a0579b62f2a--

--0000000000002e3f4e0579b62f2c
Content-Type: text/plain; charset=us-ascii


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
--0000000000002e3f4e0579b62f2c--