pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmeran Seehuber (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4341) [Patch] PNGConverter: PNG bytes to PDImageXObject converter
Date Sun, 21 Oct 2018 19:33:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658346#comment-16658346

Emmeran Seehuber commented on PDFBOX-4341:

No problem, I know there is also always that strange thing called "real life" :)
 * PDCalRGB: setMatrix(): fixed it using getValues()
 * Intent: You are right, I missed that complete, the values in the PDF are of course different
than the integers in the PNG... To be honest, the Intent is only relevant for PDF readers/printers
who try to output the image on devices whose gamut has not the full gamut of sRGB in it (nowadays
mostly laser printers, ink printers can handle sRGB usually without a problem). And as soon
as an ICC profile is specified at least the Java color management uses the intent defined
in the ICC profile.You have no other chance than to patch the ICC profile bytes if you want
an different intent as the one in the ICC profile in Java... But setting the rendering intent
on the image should not hurt in any way. 
 * The "black" indexed image was a "stale PDImageXObject state" bug, because it has to reread
the image/colorspace as soon as you call setColorSpace() on it. I.e. the bug was that PDImageXObject.setColorSpace() has
to clear the colorSpace and cachedImage field. I've added this to PDImageXObject.setColorSpace().
You can't just set the  colorspace parameter to the colorSpace field, as the PDColorSpace
used for writing has not all state initialized that is needed to read an image...
 * I stripped down the patch and fixed some bugs on the way ... the usual "how could that
work in the first place?!" stuff :)

 ** TrueColor/sRGB without transparency works
 ** When a gamma value of 1/2.2 is specified it is just ignored, because thats the gamma of
 ** Indexed images work with and without transparency. I've updated the test images and have
test files for 1, 2, 4 and 8 bit indexed images with transparency. This means if a user optimizes
his PNG image and it has not more then 256 color/transparent combinations it will usually
get indexed while optimizing and that will be kept at least for the image data. The SMASK
has to rebuilt as grayscale image as Acrobat Reader does not like indexed SMASK's. But this
will usually still be smaller and faster than recompressing the image.
 ** So "only" grayscale and grayscale/TrueColor with transparency are not handled. For grayscale
I have no idea how to map the colorspace and when transparency is combined in the image data
we simple can not map that in the PDF without a completely recoding it. Also TrueColor with
a chroma chunk but without a ICC profile is not handled, as I have no idea how to map that
color space.
 ** I've update the image zip as I had to add some additional test images for the indexed
image case.

With digitalcorpora you mean the govdoc files? I can run a test over this, but I can't yet
say when I will have time for it. 

> [Patch] PNGConverter: PNG bytes to PDImageXObject converter
> -----------------------------------------------------------
>                 Key: PDFBOX-4341
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.12
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>         Attachments: pngconvert_testimg.zip, pngconvert_v1.patch, pngconvert_v2.patch
> The attached patch implements a PNG bytes to PDImageXObject converter. It tries to create
a PDImageXObject from the chunks of a PNG image, without recompressing it. This allows to
use programs like pngcrush and friends to embedded optimal compressed images. It’s also
way faster than recompressing the image.
> The class PNGConverter does this in three steps:
>  - Parsing the PNG chunk structure from the byte array
>  - Validating all relevant data chunks (i.e. checking the CRC). Chunks which are not
needed (e.g. text chunks) are not validated.
>  - Constructing a PDImageXObject from the chunks
> When at any of this steps an error occurs or the converter detects that it is not possible
to map the image, it will bail out and return null. In this case the image has to be embedded
the „normal“ way by reading it using ImageIO and compressing it again.
> Only this PNG image types can be converted (at least theoretically) without recompressing
the image data:
>  - Grayscale
>  - Truecolor (i.e. RGB 8-Bit/16-Bit)
>  - Indexed
> As soon as transparency is used it gets difficult:
>  - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in the image
data stream, as they are stored as (Gray,Alpha) or (Red,Green,Blue,Alpha) tuples. You have
to separate the alpha information for the SMASK-Image. At this moment you can just read and
recompress it using the LosslessFactory.
>  - Indexed with alpha. Alpha and color tables are separate in the PNG, so this should
be possible to build a grayscale SMASK from the image data (which are just the table indices)
and the alpha table. Tried that, but Acrobat Reader does not like indexed SMASKs… One could
just build a grayscale SMASK using the alpha table and the decompressed image index data.
This would at least save some space, as the optimized indexed image data is still used.
> With the current patch only truecolor without alpha images work correctly. The other
tests for grayscale and indexed fail. (You must place the zipped images in the resources folder
were png.png resides to run the testdrivers; This images are „original“ work done by me
using Gimp, Krita and ImageOptim (on macOS) to build the different png image types.)
> Notes for the current patch:
>  - The grayscale images have the wrong gamma curve. I tried using the ColorSpace.CS_GRAY
ICC profile and the image seems now only „slightly“ off (i.e. pixel value FFD6D6D6 vs
FFD7D7D7). As soon as a gAMA chunk is given the image is tagged with a CalGray profile, but
the colors are way more off then.
>  - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s from
the PDF spec to convert the cRHM values to the CalRGB whitepoint and matrix. I have not yet
tested this, as I have no test image with cHRM at the moment. Note: Matrix(COSArray) and Matrix.toCOSArray()
are fine for geometric matrices. But this methods are wrong for any other kind of matrix (i.e.
color transform matrices), as they only store/restore 6 values of the 3x3 matrix. I deprecated
PDCalRGB.setMatrix(Matrix) because of this, as this was never working and can not work as
long as the Matrix class is for geometric use cases only. This should also be documented on
the Matrix class, that it is not general purpose. I added a PDCalRGB.setMatrix(COSArray) method
to allow to set the matrix.
>  - The indexed image displays fine in Acrobat Reader, but the test driver fails as PDImageXObject.getImage()
returns a complete black (everything 0) image. Strange, I suspect some error in the PDFBox
image decoding.
>  - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is attached. Theoretically
you can use a CalRGB colorspace, but using a ICC color profile is likely faster (at least
in PDFBox) and more „standard“.
> You can also look at this patch on GitHub [https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
if you like.
> It would be nice if someone could give me some hints with the colorspace problems. I
will try to reread the specs again, maybe I have missed something. But it would be great if
someone else who has an idea about colorspaces could also take a look into this.
> As I have no idea how long it takes to understand why the colors are off for grayscale
and wrong for indexed, I could prepare a stripped down version of this patch, which only contains
the working stuff (i.e. truecolor), and would just do nothing on the not working cases. What
do you think?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message