pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilija Pavlic <ilija.pav...@gmail.com>
Subject Did I understand text color extraction correctly?
Date Wed, 08 Feb 2012 13:47:21 GMT
I created a document with just one line of green text
(RGB=[146,208,80]), and wrote this small example:

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new
PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(),
page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] =
graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

That outputs
DeviceRGB
146.115
208.08
80.07

So it seems that I got the text color out of the document. However, I
am not sure I understand the color extraction correctly. Here is how I
see it:

As I understand it, PDFStreamEngine has multiple variables describing
its current state, like graphicsState, textMatrix, textLineMatrix,
etc. When PDFStreamEngine processes a page stream, it sets its state
variables depending on what operators it is processing at the moment.

So when it hits green text, it will change the PDGraphicsState
graphicsState because it will encounter appropriate operators. For CS
it will call org.apache.pdfbox.util.operator.SetStrokingColorSpace as
defined by mapping
CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the
.properties file. RG will be mapped to
org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.

When it goes on, it will change its graphicsState to something else;
the color will be changed black for black text and so on. Pdf
operators are like sequences of instructions for drawing: "pick black
color; go to (x1,y1); draw a rectangle to (x2,y2)"

In this particular case, PDGraphicsState hasn't changed because the
document has just text and the text it has is in just one style. For
something more advanced, I would need to extend PDFStreamEngine (just
like PageDrawer, PDFTextStripper and other classes do) to do something
when color changes.

Is that approximately correct?

Thank you,
Ilija.

Mime
View raw message