pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lewis <ja...@dickson.st>
Subject Re: problem extracting text from PDFs created in Windows 10
Date Mon, 16 May 2016 13:23:23 GMT
Hi Tilman,

On 16/05/2016 10:54 PM, Tilman Hausherr wrote:

>> I have a test PDF that PDFBox can read fine, but if I print that PDF in
>> windows to the CUPS PDF printer device, the resulting PDF is mangled in
>> some way that prevents PDFBox from parsing it.
> Why would you do that? You already have a PDF. Or was it just to
> explain, i.e. you're really printing from that application of yours,
> with the same problem, but you don't want to show that output because it
> is confidential?

It's just to explain the problem. We have an application (that is
supplied to us as is and that I can't modify) that prints to a PDF
device in windows as its way of making PDFs reports.

Either way, I can extract text from my PDF or a PDF generated from the
application in a version prior to windows 10. I'd like to do the same
with PDFs generated this way.

I've been reading more on it, and its definitely to do with the
Microsoft Type 4 drivers. I'll see if I can find a way to install a type
3 driver and test with that.

>> Is there something I can do to make PDFBox be able to understand the
>> mangled PDF?
> No....
>> I've also noticed that I can't select text in the broken pdf. Maybe this
>> windows driver somehow outlines all the text so its no longer text but
>> vectors?
> I had a look at the "printed" PDF with PDFDebugger. It has the text as a
> huge image, not as a text.

Thanks for confirming this. That's what I suspected.

> try this:
> http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
> http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4
Yes, for some reason, the Type 4 drivers show that option but it is
greyed out and cannot be unticked.



Jason Lewis

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message