pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: How to extract the object id from a form field?
Date Wed, 19 Aug 2015 20:30:16 GMT
Hi,

> Am 19.08.2015 um 21:16 schrieb Roberto Nibali <rnibali@gmail.com>:
> 
> Hi Tilman
> 
> Thanks for your reply ... I did not really succeed. We'll probably end up looking at
how the PDFDebugger code does it ;).
> 
> On Tue, Aug 18, 2015 at 9:08 PM, Tilman Hausherr <THausherr@t-online.de <mailto:THausherr@t-online.de>>
wrote:
> Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
>> Hi
>> 
>> I'd like to print out the corresponding object id given a specific form
>> field. How would I do that with PDFBox programmatically?
>> 
>> Let's for the sake of the argument, assume that the form field is
>> represented by the following obj:
>> 
>> obj 218 0
>>   <<
>>     /DA <2B94B0298F2FD7F81F32C6E22043>
>>     /F 4
>>     /FT /Tx
>>     /Ff 4194304
>>     /MK
>>     /P 28 0 R
>>     /Parent 46 0 R
>>     /Rect [159.781 764.53 347.142 777.195]
>>     /Subtype /Widget
>>     /T <5EB6B730886188AB3D3194B9654C18094C>
>>     /Type /Annot
>>     /V <45BBBA249C618BBD3974A4BE61501E57181D>
>>     /AP 666 0 R
>>   >>
>> 
>> If I am going over all PDField entries of a PDF, how would I get to the
>> underlying obj number (in the above case 218) from a PDField object?
> 
> I haven't tried this myself, but I think you could "synchronise" the getChildren() results
with the getCOSObject().getItem(COSName.KIDS) array, i.e. sort out which indirect type is
which item returned from getChildren(). The Kids COSArray has indirect objects (= COSObject
type), as seen here:
> 
> 
> 
> COSObject.getObject() returns the dereferenced object.
> 
> The reason I asked about this is that while migrating some documents, we found out that
the originating PDFs not only have textual changes in the PDF (mostly legal aspect changes
in the fix text); the client in certain cases modified the PDFs by adding borders or other
graphical elements inside. Those obviously do not show up in the template PDF. 
> 
> My somewhat (maybe stupid) idea was to simply print out the obj id or even the whole
object and subsequently insert it into the template for the final PDF during the form field
migration, on top of updating all references to the new obj id.
> 
> At least for simple geometric shapes, like rectangles, this should be feasible, no? Anyway,
after constantly getting "null" from the getCOSObject().getItem(COSName.KIDS) and nothing
out of getChildren() from a given PDField, I kind of gave up.
> 
> Imagine you had the following code, and wanted to additionally dump out the underlying
object id and the referencing ids of the PDField:
> @Test
> private void excuteDumpFields() throws IOException {
>     PDDocument srcDoc = null;
>     try {
>         srcDoc = PDDocument.load(new File(srcDocName)); 
>         PDAcroForm acroForm = srcDoc.getDocumentCatalog().getAcroForm();
>         List<PDField> fields = acroForm.getFields();
>         for (PDField field : fields) {
>             dumpField(srcDoc, field);
>         }
>         srcDoc.close();
>     } catch (Exception e) {
>         logerr(e.getMessage());
>     } finally {
>         if (srcDoc != null) {
>             srcDoc.close();
>         }
>     }
> }
> 
> private void dumpField(PDDocument srcDoc, PDField srcField) throws IOException {
>     if (srcField instanceof PDNonTerminalField) {
>         for (PDField child : ((PDNonTerminalField) srcField).getChildren()) {
>             dumpField(srcDoc, child);
>         }
>     } else if (!(srcField instanceof PDSignatureField)) {
>         String fqName = srcField.getFullyQualifiedName();
>         String fTypes[] = srcField.getClass().getName().split("\\.");
>         System.out.printf("fqName=%s type=%s%n", fqName, fTypes[fTypes.length-1]);
>     }
> }
> It has become customary to me to dump the objects using the pdf-parser (http://blog.didierstevens.com/programs/pdf-tools/
<http://blog.didierstevens.com/programs/pdf-tools/>) as follows to futher investigate
issues (excerpt showing the dump of object 228):
> 
> $ python pdf-parser.py -o 228 ../../ccmig2.pdf
> 
> obj 228 0
>  Type: /Annot
>  Referencing: 685 0 R, 28 0 R, 46 0 R, 686 0 R
> 
>   <<
>     /AA
>       <<
>         /K 685 0 R
>       >>
>     /DA <92F8913CB200CF3C13A363C2D20D>
>     /F 4
>     /FT /Tx
>     /Ff 12582912
>     /MK
>     /MaxLen 1
>     /P 28 0 R
>     /Parent 46 0 R
>     /Q 1
>     /Rect [454.437 769.504 465.482 782.169]
>     /Subtype /Widget
>     /T <8C8A>
>     /Type /Annot
>     /V ()
>     /AP 686 0 R
>   >>
> 
> And to get the objects referencing object 228:
> 
> $ python pdf-parser.py -r 228 ../../ccmig2.pdf
> 
> obj 28 0
>  Type: /Page
>  Referencing: 101 0 R, 217 0 R, 218 0 R, 219 0 R, 220 0 R, 221 0 R, 222 0 R, 223 0 R,
224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 230 0 R, 231 0 R, 232 0 R, 61 0 R, 60
0 R, 62 0 R, 63 0 R, 64 0 R, 65 0 R, 66 0 R, 67 0 R, 69 0 R, 68 0 R, 70 0 R, 71 0 R, 72 0
R, 73 0 R, 74 0 R, 75 0 R, 76 0 R, 77 0 R, 78 0 R, 79 0 R, 80 0 R, 81 0 R, 82 0 R, 83 0 R,
84 0 R, 86 0 R, 87 0 R, 88 0 R, 89 0 R, 90 0 R, 91 0 R, 92 0 R, 93 0 R, 94 0 R, 95 0 R, 96
0 R, 97 0 R, 85 0 R, 233 0 R, 234 0 R, 235 0 R, 236 0 R, 237 0 R, 238 0 R, 239 0 R, 22 0 R,
240 0 R, 241 0 R, 242 0 R, 243 0 R, 244 0 R, 245 0 R, 246 0 R, 247 0 R, 103 0 R, 248 0 R,
6 0 R, 205 0 R, 206 0 R, 207 0 R, 208 0 R, 209 0 R, 210 0 R, 211 0 R, 213 0 R, 212 0 R
> 
>   <<
>     /Annots '[101 0 R 217 0 R 218 0 R 219 0 R 220 0 R 221 0 R 222 0 R 223 0 R 224 0 R
225 0 R\n226 0 R 227 0 R 228 0 R 229 0 R 230 0 R 231 0 R 232 0 R 61 0 R 60 0 R 62 0 R\n63
0 R 64 0 R 65 0 R 66 0 R 67 0 R 69 0 R 68 0 R 70 0 R 71 0 R 72 0 R\n73 0 R 74 0 R 75 0 R 76
0 R 77 0 R 78 0 R 79 0 R 80 0 R 81 0 R 82 0 R\n83 0 R 84 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90
0 R 91 0 R 92 0 R 93 0 R\n94 0 R 95 0 R 96 0 R 97 0 R 85 0 R 233 0 R 234 0 R 235 0 R 236 0
R 237 0 R\n238 0 R 239 0 R 22 0 R 240 0 R 241 0 R 242 0 R 243 0 R 244 0 R 245 0 R 246 0 R\n247
0 R 103 0 R]'
>     /BleedBox [0.0 0.0 595.276 841.89]
>     /Contents 248 0 R
>     /CropBox [0.0 0.0 595.276 841.89]
>     /MediaBox [0.0 0.0 595.276 841.89]
>     /Parent 6 0 R
>     /Resources
>       <<
>         /ExtGState
>           <<
>             /GS0 205 0 R
>             /GS1 206 0 R
>             /GS2 207 0 R
>             /GS3 208 0 R
>           >>
>         /Font
>           <<
>             /C2_0 209 0 R
>             /C2_1 210 0 R
>             /TT0 211 0 R
>             /TT1 213 0 R
>             /TT2 212 0 R
>           >>
>         /ProcSet [/PDF /Text]
>       >>
>     /Rotate 0
>     /Tabs /W
>     /TrimBox [0.0 0.0 595.276 841.89]
>     /Type /Page
>   >>
> 
> 
> obj 46 0
>  Type:
>  Referencing: 218 0 R, 230 0 R, 231 0 R, 232 0 R, 219 0 R, 217 0 R, 220 0 R, 221 0 R,
222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 17 0 R
> 
>   <<
>     /Kids '[218 0 R 230 0 R 231 0 R 232 0 R 219 0 R 217 0 R 220 0 R 221 0 R 222 0 R 223
0 R\n224 0 R 225 0 R 226 0 R 227 0 R 228 0 R 229 0 R]'
>     /Parent 17 0 R
>     /T <32AB37>
>   >>
> 
> It would be tremendous if I could get at least the proper object id out of the PDFields
using PDFBox.

a PDField is uniquely identified by it's full name - which can als be used to find it within
the template. Now if someone added a border in the source document field which you would like
to add to the template document field this is part of the widget definition for the field
e.g. the /MK entry. There are also some defaults used by Acrobat e.g. when a border color
is defined there will be a small border around the field even if there is no border width
defined.

If I understood your use case correctly knowing the object id of the field wouldn't help in
this case.

BR
Maruan


> 
> Take care
> Roberto
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message