pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgeeflorez <jorgeeduardoflo...@gmail.com>
Subject Extracting page "correctly"
Date Fri, 02 Nov 2018 22:37:22 GMT
Hi all,
I want to extract the text from the page of this PDF file
<https://drive.google.com/file/d/1RMBmU2XTaSgQVDkU2eYECP8fe2SjVqFp/view?usp=sharing>.
I am using the following code to achieve it:

try (PDDocument document = PDDocument.load(new File(fileName)))
{
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition( false );
    stripper.setStartPage( 0 );
    stripper.setEndPage( document.getNumberOfPages() );

    System.out.println(stripper.getText(document));
}

The result I get (part of it) is:

----------------
A
 S
am
pl
e
P
os
te
r
La
nd
sc
ap
e
La
yo
ut
----------------

If I use  stripper.setSortByPosition( true ) I get the following (part of
it):

----------------
A Sample Poster  Landscape
Layout - Title
Name of Researcher(s)
Name of Department
Introduction Measurable Outcomes
The Mechanical Engineering Department at WPI was established in 1868 and
the first
undergraduate degrees were awarded in 1871. The Department *currently has
about 450 Graduating students* should demonstrate the following at a level
equivalent to an entry-
undergraduate students and 100 graduate students. Housed in the Higgins
Laboratory and the level engineer or first year graduate student:
Washburn shops the faculty consists of 29 tenured and tenure track
professors, and several
non-tenure track teaching staff. The Department offers undergraduate and
graduate degrees in a. An understanding of the fundamental principles of
conservation laws,
----------------

The text I get is better than the first one, but it mixes the text from
left and right "columns" (please see the bold text).
My question is: is it possible to get the text as one would naturally read
it? i.e. the text of the left column and then the text of the right column?

I attached the file, just in case the link cannot be opened.
Thanks in advance.
Best Regards.
Jorge Eduardo Flórez

Mime
View raw message