pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clemens Wyss DEV <clemens...@mysign.ch>
Subject extracting text from an "encrypted" pdf
Date Fri, 08 May 2015 15:36:24 GMT
When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:

pdfDocument = PDDocument.load( is );
PDFTextStripper pdfStripper = new PDFTextStripper(); 
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted"
is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 ); 
ParseContext context = new ParseContext(); 
parser = new AutoDetectParser(); 
context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context ); 
parsedText = handler.toString();

I get to see the text/content of the very pdf. 

1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?

2) Does the second approach possibly return "more than text"? Blobs? Binary data?
View raw message