pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgeeflorez <jorgeeduardoflo...@gmail.com>
Subject Text extraction example
Date Mon, 12 Nov 2018 18:56:42 GMT
Hi all,

first, I want to thank Tilman for his effort getting the text from a page
regardless its rotation.
(https://issues.apache.org/jira/browse/PDFBOX-4371).

second, I want to share with you a small application I created using C#. It
uses ITextSharp library and a custom text extraction strategy to get the
text.

Application: here
<https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing>
Class that process text: here
<https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing>
Sample PDF files: here
<https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing>

I was trying to port the code to Java and make it work using PDFBox
objects, but so far, it has been not possible to me.

Basically, the magic occurs in method RenderText (Based on other code I
found in a web page I don't remember :( ). It uses vectors (origin is lower
left corner of the page) to determine stuff like if there is a line break,
or if a whitespace must be put between glyphs.

I just hope this code gives you some light to adjust or improve (if you
consider it necessary) text extraction.

That's it.

Thank you.
Best Regards.

Jorge Eduardo Flórez

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message