pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dick Martin <rtmar...@nycap.rr.com>
Subject Re: extracting checkboxes in non acroform pdf
Date Thu, 29 Nov 2018 21:32:46 GMT
Yes.  There "Is there a way to list all the objects within the pdf"
That's what Tilman meant when he said "Or look at the PDF in PDFDebugger."
The PDFDebugger is a utility included in the PDFBox download (or maybe
separately downloadable?)

On Thu, Nov 29, 2018 at 3:27 PM Nicolas Paris <nicolas.paris@riseup.net>
wrote:

> On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
> > Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
> > > Hi
> > >
> > > > It could be an XFA forms pdf... then you'd have to analyze the XML
> content.
> > > I opened the pdf in a text editor, and I can say the boxes are in a
> > > stream xml entity, in binary format. (By removing some binary, I have
> > > been able to remove the boxes.
> > > Does it exclude the XFA form pdf nature ?
> >
> >
> > Sorry, "nature" looks like a bad translation, and sadly I don't know what
> > you meant...  please write that part in french, which I understand too.
>
> I meant, "do the above informations prove it is *not* a XFA form ?". I
> mean, the boxes arent in xml but in the binary part.
>
>
> >
> > PDFBox doesn't have an API for the XFA form.
> >
> > You can also upload the PDF to a sharehoster (no mail attachments). Or
> look
> > at the PDF in PDFDebugger.
>
> I cannot share any copy of the pdf. Thanks for that proposition that
> would help a lot.
>
> > >
> > > > It could be ordinary text, then the text stripper would do the job.
> > > The regular textstripper does not extract them. Does it exclude the
> text
> > > nature ?
> >
> >
> > Same problem with "nature". PDFBox cannot extract XFA forms. It can
> detect
> > glyphs that are used for forms, e.g. squares.
>
> I meant, "if the built-in pdfbox text stripper does not extract the
> check-boxes, does it prove that they are not ordinary text."
>
>
>
> How could I determine the kind of checkbox I have ? Is there a way to
> list all the objects within the pdf ?
>
>
> > >
> > > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
> > > > It could be an XFA forms pdf... then you'd have to analyze the XML
> content.
> > > >
> > > > It could be widgets annotations without acroform, then you'd have to
> analyse
> > > > these.
> > > >
> > > > It could be ordinary text, then the text stripper would do the job.
> > > >
> > > > It could be vector graphics, then it gets really difficult.
> > > >
> > > > Tilman
> > > >
> > > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> > > > > Hi
> > > > >
> > > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to
> extract
> > > > > the content as text, including the checkboxes values in it.
> > > > >
> > > > > THe pdf looks like a regular form pdf with checkboxes. However it
> is not
> > > > > a acro form based pdf, and the regular pdfbox code I use in this
> case
> > > > > does not apply : the acroform is null !
> > > > >
> > > > > I wonder how I can iterate on those checkboxes (or visually
> equivalent)
> > > > > objects or symbols.
> > > > >
> > > > > If someone can give me a starter to list all objects in that pdf,
> that
> > > > > might be helpful to begin with.
> > > > >
> > > > > Thanks by advance,
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message