pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: extracting text from an "encrypted" pdf
Date Fri, 08 May 2015 15:52:44 GMT
Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
> Thx for the very fast answer.
>> new StandardDecryptionMaterial( password );
> I have no password. The pdf is a public user manual.

Use an empty password :-)

Tilman

>
>> That is TIKA, isn't it?
> True
>
>
> -----Urspr√ľngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Freitag, 8. Mai 2015 17:44
> An: users@pdfbox.apache.org
> Betreff: Re: extracting text from an "encrypted" pdf
>
> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>> When I try to extract an "encrypted" (which can be read in AcrobatReader) document
with:
>>
>> pdfDocument = PDDocument.load( is );
> add
> if( document.isEncrypted() )
> {
>    StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection(
sdm ); }
>
> or use loadNonSeq()
>
>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>> pdfStripper.getText( pdfDocument );
>>
>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted"
is logged.
>>
>> When, on the other hand, I do:
>>
>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>> context = new ParseContext(); parser = new AutoDetectParser();
>> context.set( Parser.class, parser );
>>    parser.parse( is, handler, metadata, context ); parsedText =
>> handler.toString();
>>
>> I get to see the text/content of the very pdf.
>>
>> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>
>>    
>> 2) Does the second approach possibly return "more than text"? Blobs? Binary data?
> That is TIKA, isn't it?
>
> Tilman
>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message