pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nicolas.pa...@riseup.net>
Subject Re: extracting checkboxes in non acroform pdf
Date Thu, 29 Nov 2018 20:27:00 GMT
On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
> > Hi
> > 
> > > It could be an XFA forms pdf... then you'd have to analyze the XML content.
> > I opened the pdf in a text editor, and I can say the boxes are in a
> > stream xml entity, in binary format. (By removing some binary, I have
> > been able to remove the boxes.
> > Does it exclude the XFA form pdf nature ?
> 
> 
> Sorry, "nature" looks like a bad translation, and sadly I don't know what
> you meant...  please write that part in french, which I understand too.

I meant, "do the above informations prove it is *not* a XFA form ?". I
mean, the boxes arent in xml but in the binary part.


> 
> PDFBox doesn't have an API for the XFA form.
> 
> You can also upload the PDF to a sharehoster (no mail attachments). Or look
> at the PDF in PDFDebugger.

I cannot share any copy of the pdf. Thanks for that proposition that
would help a lot.

> > 
> > > It could be ordinary text, then the text stripper would do the job.
> > The regular textstripper does not extract them. Does it exclude the text
> > nature ?
> 
> 
> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect
> glyphs that are used for forms, e.g. squares.

I meant, "if the built-in pdfbox text stripper does not extract the
check-boxes, does it prove that they are not ordinary text."



How could I determine the kind of checkbox I have ? Is there a way to
list all the objects within the pdf ?


> > 
> > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
> > > It could be an XFA forms pdf... then you'd have to analyze the XML content.
> > > 
> > > It could be widgets annotations without acroform, then you'd have to analyse
> > > these.
> > > 
> > > It could be ordinary text, then the text stripper would do the job.
> > > 
> > > It could be vector graphics, then it gets really difficult.
> > > 
> > > Tilman
> > > 
> > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> > > > Hi
> > > > 
> > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
> > > > the content as text, including the checkboxes values in it.
> > > > 
> > > > THe pdf looks like a regular form pdf with checkboxes. However it is not
> > > > a acro form based pdf, and the regular pdfbox code I use in this case
> > > > does not apply : the acroform is null !
> > > > 
> > > > I wonder how I can iterate on those checkboxes (or visually equivalent)
> > > > objects or symbols.
> > > > 
> > > > If someone can give me a starter to list all objects in that pdf, that
> > > > might be helpful to begin with.
> > > > 
> > > > Thanks by advance,
> > > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message