From users-return-11287-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Fri Nov 2 23:37:51 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 212EA18062B for ; Fri, 2 Nov 2018 23:37:50 +0100 (CET) Received: (qmail 17263 invoked by uid 500); 2 Nov 2018 22:37:50 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 17250 invoked by uid 99); 2 Nov 2018 22:37:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2018 22:37:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B431DC2252 for ; Fri, 2 Nov 2018 22:37:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.889 X-Spam-Level: * X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 91g3Zuz4wQKV for ; Fri, 2 Nov 2018 22:37:47 +0000 (UTC) Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2F5D55F324 for ; Fri, 2 Nov 2018 22:37:47 +0000 (UTC) Received: by mail-oi1-f172.google.com with SMTP id c25-v6so2880244oiy.0 for ; Fri, 02 Nov 2018 15:37:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=uit1Q0h1mWM46sVsU3HEddyEFqyYIudewVdh3tHgbog=; b=iqWzTp2NSiL7GXBxiE4Lywwu07cmPT3nNlgs1zSHroKv/mx8b7ag5hwRj5NMpEQta5 ikszb1eaYWUS9VwbtYRWeJXArwyu/StffZZnDT56c9lXetBmHOsigTbiZEMdH4tVVwcn yxvIqEpqHQmDQmnIvFM5bxegJsiV2dHQKyjeFPLWfCTE84EL2swHXvqN4KKSVx+U51Gz YiV0IHrsNLft7LcGpNuKb0+DxP2Za3mpSvdef+S9exHhzL+0L7ccrjx3ZeEoHXlxAfR6 xDkMfuNXp1iWvnzyve5/QKrGjq0Wcl/vtu2ORqTnWEe0Vnalr1GG2ril6nIV3LzyT5Zb ae7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=uit1Q0h1mWM46sVsU3HEddyEFqyYIudewVdh3tHgbog=; b=PZ9SSEv5pweElufH8ShhgrJo01u3P8q7zebY1vVCoACE7Prt3LRrv9m/mCQydo4WfP Re3fUJYtt4claZuDFO5UoAgQhFEGGFz5hF5DWCQl+1i7CxwcQRwd4VTirwzy7ryiw/J5 mEoK2pUo+ChiJoMMapWeEvreAxrVuwCgo8xoYscPVE692hGLXcTuFj8S1m/LpgYkpTvS y+2bz5wHn1/CeSeS7URwSqtvHjGRj5pnit1XXi6ZoMnwd4m8xzkOAaN2FHa9xy8NmEW5 2u4Fg9gT8G0/r4lm7V4cjGgmmLxislSkvLELWGfBjv+NlydzChv9veNo3X1UqVoWN5Rk Hw5g== X-Gm-Message-State: AGRZ1gLLZvS+aWhXq92bZXtYrP/hUk6fRHQ6kl70HTZKD0nO/Tw1x/5J ZAF9TWSIM9AjSuC4EYFW1z+xHxZNySa9lR35Qr2868bR X-Google-Smtp-Source: AJdET5cUwqeS27kkfLNdRXfr89b8QSa8eaIy44CIGFORoGEaFyBFZ8PGDGeWqFn4svWNCAZrjwI1/e2yt5GbZoL5yu4= X-Received: by 2002:aca:af04:: with SMTP id y4-v6mr7355394oie.274.1541198266111; Fri, 02 Nov 2018 15:37:46 -0700 (PDT) MIME-Version: 1.0 From: jorgeeflorez Date: Fri, 2 Nov 2018 17:37:22 -0500 Message-ID: Subject: Extracting page "correctly" To: users@pdfbox.apache.org Content-Type: multipart/mixed; boundary="0000000000002e3f4e0579b62f2c" --0000000000002e3f4e0579b62f2c Content-Type: multipart/alternative; boundary="0000000000002e3f4a0579b62f2a" --0000000000002e3f4a0579b62f2a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, I want to extract the text from the page of this PDF file . I am using the following code to achieve it: try (PDDocument document =3D PDDocument.load(new File(fileName))) { PDFTextStripper stripper =3D new PDFTextStripper(); stripper.setSortByPosition( false ); stripper.setStartPage( 0 ); stripper.setEndPage( document.getNumberOfPages() ); System.out.println(stripper.getText(document)); } The result I get (part of it) is: ---------------- A S am pl e P os te r La nd sc ap e La yo ut ---------------- If I use stripper.setSortByPosition( true ) I get the following (part of it): ---------------- A Sample Poster Landscape Layout - Title Name of Researcher(s) Name of Department Introduction Measurable Outcomes The Mechanical Engineering Department at WPI was established in 1868 and the first undergraduate degrees were awarded in 1871. The Department *currently has about 450 Graduating students* should demonstrate the following at a level equivalent to an entry- undergraduate students and 100 graduate students. Housed in the Higgins Laboratory and the level engineer or first year graduate student: Washburn shops the faculty consists of 29 tenured and tenure track professors, and several non-tenure track teaching staff. The Department offers undergraduate and graduate degrees in a. An understanding of the fundamental principles of conservation laws, ---------------- The text I get is better than the first one, but it mixes the text from left and right "columns" (please see the bold text). My question is: is it possible to get the text as one would naturally read it? i.e. the text of the left column and then the text of the right column? I attached the file, just in case the link cannot be opened. Thanks in advance. Best Regards. Jorge Eduardo Fl=C3=B3rez --0000000000002e3f4a0579b62f2a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi= all,
I want to extract the text from the page of this PDF file. I am using the following code to achieve it:
<= br>
try (PDDocument document =3D PDDocument.load(new File(fileNam= e)))
{
=C2=A0=C2=A0=C2=A0 PDFTextStripper stripper =3D new PDFTextStr= ipper();
=C2=A0=C2=A0=C2=A0 stripper.setSortByPosition( false );
=C2= =A0=C2=A0=C2=A0 stripper.setStartPage( 0 );
=C2=A0=C2=A0=C2=A0 stripper.= setEndPage( document.getNumberOfPages() );

=C2=A0=C2=A0=C2=A0 System= .out.println(stripper.getText(document));
}

The result= I get (part of it) is:

----------------
=
A
=C2=A0S
am
pl
e
P
os
te
r=C2=A0
La
n= d
sc
ap
e
La
yo
ut
----------------
If I use=C2=A0 stripper.setSortByPosition( true ) I get the following (part of it):

----------------
A Sample Poster=C2=A0 Landscape =
Layout - Title
Name of Researcher(s)
Name of Department
Introd= uction Measurable Outcomes
The Mechanical Engineering Department at WPI = was established in 1868 and the first
undergraduate degrees were awarde= d in 1871. The Department currently has about 450 Graduating students should demonstrate the following at a level equivalent to an entry-
un= dergraduate students and 100 graduate students. Housed in the Higgins Labor= atory and the level engineer or first year graduate student:
Washburn s= hops the faculty consists of 29 tenured and tenure track professors, and se= veral
non-tenure track teaching staff. The Department offers undergradu= ate and graduate degrees in a. An understanding of the fundamental principl= es of conservation laws,
----------------

= The text I get is better than the first one, but it mixes the text from lef= t and right "columns" (please see the bold text).
= My question is: is it possible to get the text as one would naturally read = it? i.e. the text of the left column and then the text of the right column?=

I attached the file, just in case the link cannot be ope= ned.
Thanks in advance.
Best Regards.
Jorge Eduardo Fl=C3=B3rez
--0000000000002e3f4a0579b62f2a-- --0000000000002e3f4e0579b62f2c Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org --0000000000002e3f4e0579b62f2c--