pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: extracting embedded documents -- will getEmbeddedFile() alone miss embedded DOS/Unix/Mac files?
Date Thu, 24 Jul 2014 10:46:01 GMT

> "Allison, Timothy B." <tallison@mitre.org> hat am 23. Juli 2014 um 20:21
> geschrieben:
> All,
>   Over on Tika, it looks like we copied
>org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles to extract embedded
>files.  As I look at the source code for PDComplexFileSpecification, I notice
>that getEmbeddedFile() does not behave like getFilename(); that is, it doesn't
>iterate through the various formats and return the first non null.
>   When we try to get the PDEmbeddedFile, should we try each of these instead
>of just getEmbeddedFile()?

> getEmbeddedFile()
> getEmbeddedFileDos()
> getEmbeddedFileUnix()
> getEmbeddedFileMac()
>   Will getEmbeddedFile() alone potentially miss embedded files?
Yes. "getFilename()" was created for convenience. There isn't such method for
the embedded file, so that you have to look yourself.

BTW: According to the spec, the Dos, Unix and Mac mutations shouldn't be used
anymore, therefore we should rearrange the order in "getFilename"
BTW2: Analog to "get/setFileXXX" we should add the missing
BTW3: We should rename getUnicodeFile to getFileUnicode and add a setter for
that value as well

I'll take care about that, see PDFBOX-2239

>    Thank you.
>          Best,
>                     Tim

Andreas Lehmkühler

View raw message