pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Make PDFBox fail on bad pdf
Date Thu, 30 Mar 2017 12:29:22 GMT
Hi,

> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <wouter.deborger@inmanta.com>:
> 
> Hi,
> 
> Thanks for the hint! I'll try to add some content there, as I can
> definitely use a garbage detector.
> 
> In this case, however, I was specifically trying to avoid using a
> statistical detector. PDFBox already knows there is a problem,

that is not the case here. From PDFBox perspective everything is fine. It's extracting the
text according to the definition and information in the PDF. That this is garbage from a users
perspective would mean that PDFBox 'understands' that the extracted text is not meaningful.
BR
Maruan 

> so there is
> no need to examine the content to attempt to detect a problem.
> I would like to be able to capture the problem when and where it is known,
> as this is easier and more accurate.
> 
> Thanks,
> Wouter
> 
> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <tallison@mitre.org>
> wrote:
> 
>> If you have any recommendations for the more general case, let us know on
>> TIKA-1443 [1].
>> 
>> [1] https://issues.apache.org/jira/browse/TIKA-1443
>> 
>> -----Original Message-----
>> From: Wouter De Borger [mailto:wouter.deborger@inmanta.com]
>> Sent: Thursday, March 30, 2017 6:00 AM
>> To: users@pdfbox.apache.org
>> Subject: Make PDFBox fail on bad pdf
>> 
>> Hi All,
>> 
>> When a pdf has bad encoding, PDFBox produces garbage (as explained in the
>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
>> 
>> Can I make PDFBox fail in this case instead of producing garbage?
>> 
>> (I'm working on a system that can also do OCR, so at the least sign of
>> trouble, I would like PDF box to fail and try OCR.)
>> 
>> Thanks,
>> Wouter
>> 
> 
> 
> 
> -- 
> Wouter De Borger, PhD
> Co-founder Inmanta
> www.inmanta.com
> Email: wouter.deborger@inmanta.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message