pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: extracting checkboxes in non acroform pdf
Date Thu, 29 Nov 2018 22:17:58 GMT

> Am 29.11.2018 um 20:56 schrieb Tilman Hausherr <THausherr@t-online.de>:
> 
> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
>> Hi
>> 
>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>> I opened the pdf in a text editor, and I can say the boxes are in a
>> stream xml entity, in binary format. (By removing some binary, I have
>> been able to remove the boxes.
>> Does it exclude the XFA form pdf nature ?
> 
> 
> Sorry, "nature" looks like a bad translation, and sadly I don't know what you meant...
 please write that part in french, which I understand too.
> 
> PDFBox doesn't have an API for the XFA form.

That's not completely correct. If there is an XFA form AcroForm.getXFA().getDocument() will
return the XFA as an XML Document object and AcroForm.getXFA().getBytes() will return the
(XML) content. From there you are on your own and need to process the XML.

BR
Maruan 

> 
> You can also upload the PDF to a sharehoster (no mail attachments). Or look at the PDF
in PDFDebugger.
> 
> 
>> 
>>> It could be ordinary text, then the text stripper would do the job.
>> The regular textstripper does not extract them. Does it exclude the text
>> nature ?
> 
> 
> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect glyphs that
are used for forms, e.g. squares.
> 
> Tilman
> 
> 
>> 
>> Thanks a lot
>> 
>> On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>> 
>>> It could be widgets annotations without acroform, then you'd have to analyse
>>> these.
>>> 
>>> It could be ordinary text, then the text stripper would do the job.
>>> 
>>> It could be vector graphics, then it gets really difficult.
>>> 
>>> Tilman
>>> 
>>> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
>>>> Hi
>>>> 
>>>> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
>>>> the content as text, including the checkboxes values in it.
>>>> 
>>>> THe pdf looks like a regular form pdf with checkboxes. However it is not
>>>> a acro form based pdf, and the regular pdfbox code I use in this case
>>>> does not apply : the acroform is null !
>>>> 
>>>> I wonder how I can iterate on those checkboxes (or visually equivalent)
>>>> objects or symbols.
>>>> 
>>>> If someone can give me a starter to list all objects in that pdf, that
>>>> might be helpful to begin with.
>>>> 
>>>> Thanks by advance,
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message