pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roberto Nibali <rnib...@gmail.com>
Subject Re: How to extract the object id from a form field?
Date Wed, 19 Aug 2015 19:16:52 GMT
Hi Tilman

Thanks for your reply ... I did not really succeed. We'll probably end up
looking at how the PDFDebugger code does it ;).

On Tue, Aug 18, 2015 at 9:08 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
>
> Hi
>
> I'd like to print out the corresponding object id given a specific form
> field. How would I do that with PDFBox programmatically?
>
> Let's for the sake of the argument, assume that the form field is
> represented by the following obj:
>
> obj 218 0
>   <<
>     /DA <2B94B0298F2FD7F81F32C6E22043>
>     /F 4
>     /FT /Tx
>     /Ff 4194304
>     /MK
>     /P 28 0 R
>     /Parent 46 0 R
>     /Rect [159.781 764.53 347.142 777.195]
>     /Subtype /Widget
>     /T <5EB6B730886188AB3D3194B9654C18094C>
>     /Type /Annot
>     /V <45BBBA249C618BBD3974A4BE61501E57181D>
>     /AP 666 0 R
>   >>
>
> If I am going over all PDField entries of a PDF, how would I get to the
> underlying obj number (in the above case 218) from a PDField object?
>
>
> I haven't tried this myself, but I think you could "synchronise" the
> getChildren() results with the getCOSObject().getItem(COSName.KIDS) array,
> i.e. sort out which indirect type is which item returned from
> getChildren(). The Kids COSArray has indirect objects (= COSObject type),
> as seen here:
>
>
>
> COSObject.getObject() returns the dereferenced object.
>

The reason I asked about this is that while migrating some documents, we
found out that the originating PDFs not only have textual changes in the
PDF (mostly legal aspect changes in the fix text); the client in certain
cases modified the PDFs by adding borders or other graphical elements
inside. Those obviously do not show up in the template PDF.

My somewhat (maybe stupid) idea was to simply print out the obj id or even
the whole object and subsequently insert it into the template for the final
PDF during the form field migration, on top of updating all references to
the new obj id.

At least for simple geometric shapes, like rectangles, this should be
feasible, no? Anyway, after constantly getting "null" from the
getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() from
a given PDField, I kind of gave up.

Imagine you had the following code, and wanted to additionally dump out the
underlying object id and the referencing ids of the PDField:

@Test
private void excuteDumpFields() throws IOException {
    PDDocument srcDoc = null;
    try {
        srcDoc = PDDocument.load(new File(srcDocName));
        PDAcroForm acroForm = srcDoc.getDocumentCatalog().getAcroForm();
        List<PDField> fields = acroForm.getFields();
        for (PDField field : fields) {
            dumpField(srcDoc, field);
        }
        srcDoc.close();
    } catch (Exception e) {
        logerr(e.getMessage());
    } finally {
        if (srcDoc != null) {
            srcDoc.close();
        }
    }
}

private void dumpField(PDDocument srcDoc, PDField srcField) throws IOException {
    if (srcField instanceof PDNonTerminalField) {
        for (PDField child : ((PDNonTerminalField) srcField).getChildren()) {
            dumpField(srcDoc, child);
        }
    } else if (!(srcField instanceof PDSignatureField)) {
        String fqName = srcField.getFullyQualifiedName();
        String fTypes[] = srcField.getClass().getName().split("\\.");
        System.out.printf("fqName=%s type=%s%n", fqName,
fTypes[fTypes.length-1]);
    }
}

It has become customary to me to dump the objects using the pdf-parser (
http://blog.didierstevens.com/programs/pdf-tools/) as follows to futher
investigate issues (excerpt showing the dump of object 228):

$ python pdf-parser.py -o 228 ../../ccmig2.pdf

obj 228 0
 Type: /Annot
 Referencing: 685 0 R, 28 0 R, 46 0 R, 686 0 R

  <<
    /AA
      <<
        /K 685 0 R
      >>
    /DA <92F8913CB200CF3C13A363C2D20D>
    /F 4
    /FT /Tx
    /Ff 12582912
    /MK
    /MaxLen 1
    /P 28 0 R
    /Parent 46 0 R
    /Q 1
    /Rect [454.437 769.504 465.482 782.169]
    /Subtype /Widget
    /T <8C8A>
    /Type /Annot
    /V ()
    /AP 686 0 R
  >>

And to get the objects referencing object 228:

$ python pdf-parser.py -r 228 ../../ccmig2.pdf

obj 28 0
 Type: /Page
 Referencing: 101 0 R, 217 0 R, 218 0 R, 219 0 R, 220 0 R, 221 0 R, 222 0
R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 230 0 R,
231 0 R, 232 0 R, 61 0 R, 60 0 R, 62 0 R, 63 0 R, 64 0 R, 65 0 R, 66 0 R,
67 0 R, 69 0 R, 68 0 R, 70 0 R, 71 0 R, 72 0 R, 73 0 R, 74 0 R, 75 0 R, 76
0 R, 77 0 R, 78 0 R, 79 0 R, 80 0 R, 81 0 R, 82 0 R, 83 0 R, 84 0 R, 86 0
R, 87 0 R, 88 0 R, 89 0 R, 90 0 R, 91 0 R, 92 0 R, 93 0 R, 94 0 R, 95 0 R,
96 0 R, 97 0 R, 85 0 R, 233 0 R, 234 0 R, 235 0 R, 236 0 R, 237 0 R, 238 0
R, 239 0 R, 22 0 R, 240 0 R, 241 0 R, 242 0 R, 243 0 R, 244 0 R, 245 0 R,
246 0 R, 247 0 R, 103 0 R, 248 0 R, 6 0 R, 205 0 R, 206 0 R, 207 0 R, 208 0
R, 209 0 R, 210 0 R, 211 0 R, 213 0 R, 212 0 R

  <<
    /Annots '[101 0 R 217 0 R 218 0 R 219 0 R 220 0 R 221 0 R 222 0 R 223 0
R 224 0 R 225 0 R\n226 0 R 227 0 R 228 0 R 229 0 R 230 0 R 231 0 R 232 0 R
61 0 R 60 0 R 62 0 R\n63 0 R 64 0 R 65 0 R 66 0 R 67 0 R 69 0 R 68 0 R 70 0
R 71 0 R 72 0 R\n73 0 R 74 0 R 75 0 R 76 0 R 77 0 R 78 0 R 79 0 R 80 0 R 81
0 R 82 0 R\n83 0 R 84 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90 0 R 91 0 R 92 0 R
93 0 R\n94 0 R 95 0 R 96 0 R 97 0 R 85 0 R 233 0 R 234 0 R 235 0 R 236 0 R
237 0 R\n238 0 R 239 0 R 22 0 R 240 0 R 241 0 R 242 0 R 243 0 R 244 0 R 245
0 R 246 0 R\n247 0 R 103 0 R]'
    /BleedBox [0.0 0.0 595.276 841.89]
    /Contents 248 0 R
    /CropBox [0.0 0.0 595.276 841.89]
    /MediaBox [0.0 0.0 595.276 841.89]
    /Parent 6 0 R
    /Resources
      <<
        /ExtGState
          <<
            /GS0 205 0 R
            /GS1 206 0 R
            /GS2 207 0 R
            /GS3 208 0 R
          >>
        /Font
          <<
            /C2_0 209 0 R
            /C2_1 210 0 R
            /TT0 211 0 R
            /TT1 213 0 R
            /TT2 212 0 R
          >>
        /ProcSet [/PDF /Text]
      >>
    /Rotate 0
    /Tabs /W
    /TrimBox [0.0 0.0 595.276 841.89]
    /Type /Page
  >>


obj 46 0
 Type:
 Referencing: 218 0 R, 230 0 R, 231 0 R, 232 0 R, 219 0 R, 217 0 R, 220 0
R, 221 0 R, 222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R,
229 0 R, 17 0 R

  <<
    /Kids '[218 0 R 230 0 R 231 0 R 232 0 R 219 0 R 217 0 R 220 0 R 221 0 R
222 0 R 223 0 R\n224 0 R 225 0 R 226 0 R 227 0 R 228 0 R 229 0 R]'
    /Parent 17 0 R
    /T <32AB37>
  >>

It would be tremendous if I could get at least the proper object id out of
the PDFields using PDFBox.

Take care
Roberto

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message