pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: problem extracting text from PDFs created in Windows 10
Date Mon, 16 May 2016 12:54:15 GMT
Am 16.05.2016 um 14:26 schrieb Jason Lewis:
> Hi,
> I'm having a problem using PDFBox to extract text from PDFs.
> I have an application that prints to a PDF printer device in Windows.
> The PDF printer device is actually cups-pdf on a linux server.
> Under Windows 7 I had the same problem extracting text from PDFs that
> were generated in this way, they seemed unreadable by PDFBox. Eventually
> I solved this by turning off the "Enable advanced printing features" in
> the Windows printer driver settings. After that PDFBox was able to
> extract the text perfectly.
> In windows 10 however you can't turn this option off. From what I gather
> Windows 10 uses "type 4" printer drivers and the option "enable advanced
> printing features" is ticked but greyed out so you can't un-tick it.
> I have a test PDF that PDFBox can read fine, but if I print that PDF in
> windows to the CUPS PDF printer device, the resulting PDF is mangled in
> some way that prevents PDFBox from parsing it.

Why would you do that? You already have a PDF. Or was it just to 
explain, i.e. you're really printing from that application of yours, 
with the same problem, but you don't want to show that output because it 
is confidential?

> Is there something I can do to make PDFBox be able to understand the
> mangled PDF?


> I've also noticed that I can't select text in the broken pdf. Maybe this
> windows driver somehow outlines all the text so its no longer text but
> vectors?

I had a look at the "printed" PDF with PDFDebugger. It has the text as a 
huge image, not as a text.

try this:


> I'm using PDFBox like this:
> java -jar pdfbox-app-2.0.1.jar ExtractText -encoding UTF-8 -console
> -startPage 1 -endPage 1 test-pdf-broken.pdf
> Link to working PDF:
> https://www.dropbox.com/s/glcmhl7nkg8w45f/test-pdf-works.pdf?dl=0
> link to broken PDF:
> https://www.dropbox.com/s/uriq36brougr4z1/test-pdf-broken.pdf?dl=0
> Any suggestions on how I might fix this?
> Thanks,
> Jason

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message