pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lewis <ja...@dickson.st>
Subject problem extracting text from PDFs created in Windows 10
Date Mon, 16 May 2016 12:26:54 GMT

I'm having a problem using PDFBox to extract text from PDFs.

I have an application that prints to a PDF printer device in Windows.
The PDF printer device is actually cups-pdf on a linux server.

Under Windows 7 I had the same problem extracting text from PDFs that
were generated in this way, they seemed unreadable by PDFBox. Eventually
I solved this by turning off the "Enable advanced printing features" in
the Windows printer driver settings. After that PDFBox was able to
extract the text perfectly.

In windows 10 however you can't turn this option off. From what I gather
Windows 10 uses "type 4" printer drivers and the option "enable advanced
printing features" is ticked but greyed out so you can't un-tick it.

I have a test PDF that PDFBox can read fine, but if I print that PDF in
windows to the CUPS PDF printer device, the resulting PDF is mangled in
some way that prevents PDFBox from parsing it.

Is there something I can do to make PDFBox be able to understand the
mangled PDF?

I've also noticed that I can't select text in the broken pdf. Maybe this
windows driver somehow outlines all the text so its no longer text but

I'm using PDFBox like this:

java -jar pdfbox-app-2.0.1.jar ExtractText -encoding UTF-8 -console
-startPage 1 -endPage 1 test-pdf-broken.pdf

Link to working PDF:

link to broken PDF:

Any suggestions on how I might fix this?


Jason Lewis

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message