pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Dimeo (Jira)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4341) [Patch] PNGConverter: PNG bytes to PDImageXObject converter
Date Mon, 07 Oct 2019 18:32:01 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946115#comment-16946115

John Dimeo commented on PDFBOX-4341:

Thank you [~rototor] - I've used your fork (thanks to Jitpack.io!) in our project and it was
a lifesaver. We're using Pngj to write index colored PNGS with 8 colors and Floyd-Steinberg
dithering for the best combination of visual fidelity and very small file size and your code
allows us to insert those PNGS in the PDF directly.

> [Patch] PNGConverter: PNG bytes to PDImageXObject converter
> -----------------------------------------------------------
>                 Key: PDFBOX-4341
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.12
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>         Attachments: 001229.png, 001230.png, 008528.png, 014431.png, 016289.png, 017012.png,
017030.png, 017063.png, 017084.png, image-2018-10-25-09-29-47-251.png, optimized.zip, pngconvert_testimg.zip,
pngconvert_v1.patch, pngconvert_v2.patch
> The attached patch implements a PNG bytes to PDImageXObject converter. It tries to create
a PDImageXObject from the chunks of a PNG image, without recompressing it. This allows to
use programs like pngcrush and friends to embedded optimal compressed images. It’s also
way faster than recompressing the image.
> The class PNGConverter does this in three steps:
>  - Parsing the PNG chunk structure from the byte array
>  - Validating all relevant data chunks (i.e. checking the CRC). Chunks which are not
needed (e.g. text chunks) are not validated.
>  - Constructing a PDImageXObject from the chunks
> When at any of this steps an error occurs or the converter detects that it is not possible
to map the image, it will bail out and return null. In this case the image has to be embedded
the „normal“ way by reading it using ImageIO and compressing it again.
> Only this PNG image types can be converted (at least theoretically) without recompressing
the image data:
>  - Grayscale
>  - Truecolor (i.e. RGB 8-Bit/16-Bit)
>  - Indexed
> As soon as transparency is used it gets difficult:
>  - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in the image
data stream, as they are stored as (Gray,Alpha) or (Red,Green,Blue,Alpha) tuples. You have
to separate the alpha information for the SMASK-Image. At this moment you can just read and
recompress it using the LosslessFactory.
>  - Indexed with alpha. Alpha and color tables are separate in the PNG, so this should
be possible to build a grayscale SMASK from the image data (which are just the table indices)
and the alpha table. Tried that, but Acrobat Reader does not like indexed SMASKs… One could
just build a grayscale SMASK using the alpha table and the decompressed image index data.
This would at least save some space, as the optimized indexed image data is still used.
> With the current patch only truecolor without alpha images work correctly. The other
tests for grayscale and indexed fail. (You must place the zipped images in the resources folder
were png.png resides to run the testdrivers; This images are „original“ work done by me
using Gimp, Krita and ImageOptim (on macOS) to build the different png image types.)
> Notes for the current patch:
>  - The grayscale images have the wrong gamma curve. I tried using the ColorSpace.CS_GRAY
ICC profile and the image seems now only „slightly“ off (i.e. pixel value FFD6D6D6 vs
FFD7D7D7). As soon as a gAMA chunk is given the image is tagged with a CalGray profile, but
the colors are way more off then.
>  - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s from
the PDF spec to convert the cRHM values to the CalRGB whitepoint and matrix. I have not yet
tested this, as I have no test image with cHRM at the moment. Note: Matrix(COSArray) and Matrix.toCOSArray()
are fine for geometric matrices. But this methods are wrong for any other kind of matrix (i.e.
color transform matrices), as they only store/restore 6 values of the 3x3 matrix. I deprecated
PDCalRGB.setMatrix(Matrix) because of this, as this was never working and can not work as
long as the Matrix class is for geometric use cases only. This should also be documented on
the Matrix class, that it is not general purpose. I added a PDCalRGB.setMatrix(COSArray) method
to allow to set the matrix.
>  - The indexed image displays fine in Acrobat Reader, but the test driver fails as PDImageXObject.getImage()
returns a complete black (everything 0) image. Strange, I suspect some error in the PDFBox
image decoding.
>  - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is attached. Theoretically
you can use a CalRGB colorspace, but using a ICC color profile is likely faster (at least
in PDFBox) and more „standard“.
> You can also look at this patch on GitHub [https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
if you like.
> It would be nice if someone could give me some hints with the colorspace problems. I
will try to reread the specs again, maybe I have missed something. But it would be great if
someone else who has an idea about colorspaces could also take a look into this.
> As I have no idea how long it takes to understand why the colors are off for grayscale
and wrong for indexed, I could prepare a stripped down version of this patch, which only contains
the working stuff (i.e. truecolor), and would just do nothing on the not working cases. What
do you think?

This message was sent by Atlassian Jira

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message