pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Suppressing layers on output
Date Tue, 21 Jun 2016 15:59:04 GMT

> On 19 Jun 2016, at 08:04, Andreas Lehmkuehler <andreas@lehmi.de> wrote:
> 
> Am 19.06.2016 um 16:11 schrieb Tilman Hausherr:
>> Am 19.06.2016 um 08:52 schrieb John Hewson:
>>>>> >>JIRA, and attach your code as a patch / diff.
>>>> >There is already some code handling those operators, see
>>>> PDFMarkedContentExtractor. It could be moved to a more generic place so that
>>>> we have to add some filtering only.
>>> Yes, that's is the proper way to handle this. Operators are handled with a an
>>> OperatorProcessor, not my modifying the parser (e.g. processStreamOperators).
>>> Better yet, we already have the code to handle BMC/EMC. All that is needed is
>>> for PDFRenderer to add a constructor which accepts a list of layer names to
>>> render, which are then passed as part of PageDrawerParmeters.
>> 
>> The problem is that these two operators influence whether or not all the other
>> tokens in the content stream are used or not. So the method by C. makes sense to
>> me.  The alternative would be to alter every operator processor to check whether
>> it is relevant or not.
>> Or they would have to be extended from some common class that does this check.

The alternative is actually really simple. The parser should no be responsible for high-level
processing such as this. It’s the job of an OperatorProcessor to handle how operators are
processed, and of PDFStreamEngine to handle the actual work - that’s the core of our
subclassing & extensibility model for PDFBox.

So take the view that BMC and EMC don’t affect the tokens, they affect rendering. We should
still process the
tokens as normal and have BMC and EMC set a flag on PageDrawer (or one of its superclasses)
which indicates which layer is currently being processed. The PageDrawer can then decide what
to do
with this information - namely check in strokePath, fillPath, fillAndStrokePath, and drawImage
whether
or not to suppress rendering. No need to extend any OperatorProcessor’s.

I’ve explained how this would be done for PageDrawer , but i t might be better to do all
of this in
PDFStreamEngine rather than PageDrawer, as then other subclasses can benefit form this functionality.

>> PDFMarkedContentExtractor is not really helpful. Here's some code to show what
>> it does - it shows the objects that belong to a specific group. The output
>> cannot be used for rendering.
> Maybe there is a misunderstanding. We need to track the current layer and the stack of
all current layers. C. provided some code doing that and we already have some code doing it
(I'm talking about the operators in org.apache.pdfbox.contentstream.operator.markedcontent).
What is missing is some sort of filter based on that information.

Exactly, PDFMarkedContentExtractor already contains implementations of the necessary OperatorProcessor’s.
We just need to move them into separate files, and as you say, add some sort of filter in
PDFStreamEngine / PageDrawer.

> BR
> Andreas
>> 
>> 
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Arrays;
>> import java.util.List;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDPage;
>> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
>> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>> import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
>> import
>> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
>> import org.apache.pdfbox.text.PDFMarkedContentExtractor;
>> 
>> public class ExtractMarkedContent extends PDFMarkedContentExtractor
>> {
>> 
>>    public ExtractMarkedContent() throws IOException
>>    {
>>    }
>> 
>>    public static void main(String[] args) throws IOException
>>    {
>> 
>>       PDDocument doc = PDDocument.load(new File("C......\\PDFBox
>> reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
>>        PDOptionalContentProperties ocp =
>> doc.getDocumentCatalog().getOCProperties();
>>        System.out.println("Group names in document catalog: " +
>> Arrays.toString(ocp.getGroupNames()));
>>        for (String groupName : ocp.getGroupNames())
>>        {
>>            PDOptionalContentGroup group = ocp.getGroup(groupName);
>>            System.out.println(group.getCOSObject());
>>        }
>>        ExtractMarkedContent extractMarkedContent = new ExtractMarkedContent();
>>        PDPage page = doc.getPage(0);
>>        System.out.println("Property names in page resources: " +
>> page.getResources().getPropertiesNames());
>>        extractMarkedContent.processPage(page);
>>        List<PDMarkedContent> markedContents =
>> extractMarkedContent.getMarkedContents();
>>        System.out.println("Extracted contents: ");
>>        for (PDMarkedContent mc : markedContents)
>>        {
>>            PDPropertyList propertyList =
>> page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
>>            String propName = propertyList.getCOSObject().getString(COSName.NAME);
>>            System.out.println(mc.getTag() + " (" + propName + "): " +
>> mc.getContents());
>>        }
>>        doc.close();
>>    }
>> }
>> 
>> 
>> The output is:
>> 
>> Group names in document catalog: [background, enabled, disabled]
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{background})
}
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{enabled}) }
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{disabled}) }
>> Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
>> Extracted contents:
>> oc1 (background): [P, D, F,  , 1, ., 5, :,  , O, p, t, i, o, n, a, l,  , C, o,
>> n, t, e, n, t,  , G, r, o, u, p, s, Y, o, u,  , s, h, o, u, l, d,  , s, e, e,  ,
>> a,  , g, r, e, e, n,  , t, e, x, t, l, i, n, e, ,,  , b, u, t,  , n, o,  , r, e,
>> d,  , t, e, x, t,  , l, i, n, e, .]
>> oc2 (enabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a, n,  , e, n, a, b, l,
>> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
>> t, h, a, t, ', s,  , g, o, o, d, .]
>> oc3 (disabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a,  , d, i, s, a, b, l,
>> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
>> t, h, a, t, ', s,  , N, O, T,  , g, o, o, d, !]
>> 
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:users-unsubscribe@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:users-help@pdfbox.apache.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message