pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roberto Nibali <rnib...@gmail.com>
Subject Get content of a specific object
Date Wed, 26 Aug 2015 22:39:26 GMT
Hi

I'm looking at a PDF using PDFDebugger and the text I'd like to extract
from the PDF is inside the Content of node Root/Pages/Kids/[0]/Contents,
according to PDFDebugger. How do I programmatically dig down to this node
to extract the flatdecoded ASCII stream hiding there inside object [3 0 R]?

The stream's content first bytes look as follows in the PDFDebugger:

q
  1 1 1 rg
  /a0 gs
  14.16 827.76 566.879 -824.879 re
  f
  BT
    9.9984 0 0 9.9984 70.8 806.64 Tm
    /f-0-0 1 Tf
    [ ($) 6 ($) 6 (Do) 6 (s) 20 (s) 20 (i) -17 (er) -7 (n) 6 (r=) -16 (3)
26 ('3) 6 (9) 6 (4'5) 6 (98) ] TJ

Any pointers would be most welcome. In the above example, I'd like to
extract the text "$$DossierNr"

As a sidenote: a wonderful enhancement to the PDFDebugger would be to
obtain working PDFBox code for a given node upon right-click on certain
nodes inside the left-hand side pane.

Cheers
Roberto

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message