pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Williams <evan.willi...@zapprx.com>
Subject Re: Trouble With Dots In Field Names
Date Sat, 24 Sep 2016 19:04:56 GMT
I appreciate that Olaf, thank you.

I observed exactly what you did about the structure of PDFs and the dot.

And I have been dealing with it by manually renaming the fields in Acrobat.

What I am trying to do is NOT manually rename them in Acrobat.

I am doing work that is already manual labor intensive and wasteful of
human time, and I am specifically trying to automate away as much as
possible of it.

It is entirely possible to go into acrobat and 'fix' it, but there are 100
things in every form that need to be 'fixed'. I would like to only have 99
problems.

On Sat, Sep 24, 2016 at 2:56 PM, Olaf Drümmer <olaflist@callassoftware.com>
wrote:

> AFAIK the period serves as a delimiter for nodes and leaves in a tree.
>
> Example:
>
> sender.address.name.first
> sender.address.name.first
> sender.address.street.name
> sender.address.street.number
> sender.address.ZIP
> sender.address.city
> …
>
> the actual fields (that can contain some value) are the leaf items: first,
> last, name. number, ZIP, city
>
> To the best of my knowledge, if a field is named “W55.21” it is actually a
> leaf item “21”(that can have a value)  inside a parent node “W55” (that
> can’t hold a value).
>
> It looks like someone built AcroForm  forms without understanding AcroForm
> forms.
>
> Not sure how to “fix” this by using PDFBox. Maybe you need to rename the
> fields into something that doesn’t use a period.
>
>
> Olaf
>
>
>
> > On 24 Sep 2016, at 17:13, Evan Williams <evan.williams@zapprx.com>
> wrote:
> >
> > I have a problem, but I think it's non-terminal.
> >
> > I have been using PDFBox to work with forms for about a year and a half,
> > and I have a handle on many things, but I have a persistent and
> pernicious
> > issue with forms where fields have periods ('.') in their name.
> >
> > These forms are from external sources and are typically old school
> > AcroForms. Because of the nature of the forms (medical), they often
> contain
> > decimal values like '0.5 mg' or 'W55.21'. These forms do not seem to have
> > ever been meant to be read programatically. They are for human
> consumption.
> >
> > As far as I can tell, '.' is a magic character used by fully qualified
> > names that delineates elements of the path. So when I iterate over the
> > fields I get a bunch of name fragments as 'PDNonTerminalField's and
> regular
> > fields.
> >
> > My current way of dealing with this is to waste the time of a skilled
> > graphic designer, or my own time, manually going in and fixing it. This
> is
> > mostly just an annoyance. But annoyances add up. And I am trying to
> > automate as much as I possibly can in dealing with these forms.
> >
> > *Is there any obvious way to identify this corrupt situation and correct
> it*
> >
> > I wonder if I Am just doing something wrong (I am iterating over the
> > fields in the time honored way that the form example that is included
> with
> > PDFBox uses).
> >
> > Adobe Acrobat seems perfectly happy to deal with fields containing
> periods
> > (including, unfortunately, allowing people to create them). So there must
> > be some way to deal with this.
> >
> > Your advice would be of great service to me.
> >
> > Thank you.
> > --
> > *Evan Williams*
> > Sr. Software Engineer
> > evan.williams@zapprx.com
> >
> > *www.ZappRx.com <http://www.zapprx.com/>*
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
*Evan Williams*
Sr. Software Engineer
evan.williams@zapprx.com

*www.ZappRx.com <http://www.zapprx.com/>*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message