pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roberto Nibali <rnib...@gmail.com>
Subject Re: Sanitizing input?
Date Mon, 10 Aug 2015 00:22:47 GMT

Disclaimer: I have very limited knowledge of the PDF standard and only
command the basics of PDFBox, however I have had my share of thrills with
the IRS.

What's the final purpose of those filled out PDFs? Do you intend to be MeF (
http://www.irs.gov/pub/irs-pdf/p4164.pdf) compliant? Are we talking about
such PDFs (http://www.irs.gov/pub/irs-pdf/, which btw are/could be quite a
test bed for PDFBox)? If MeF sounds intruiging to you, "simply" model and
validate the input with the IRS' XSD for MeF and model your application
around such a stable data governance.

Generally the IRS does extensive post-processing on the input documents, so
I wouldn't bother too much. But depending on the kind of service you offer,
you mileage will vary. Now, we had our share of fun with the IRS when
filling out claims from "untrusted sources". If you provide a certified tax
service, you might also need to adhere to processing standards set forth by
the NIST, as in NIST SP-800-xx (53, for example), outlined for agencies in

If you're just doing it for some friends, apply the basic sanitizing
aspects you figure out and go from there. Improve it over time, depending
on the feedback of the IRS' process.

Best regards


On Sun, Aug 9, 2015 at 11:10 PM, Stuart Small <stuart.alan.small@gmail.com>

> I am putting together a system that automatically generates some tax forms
> off of user input.  The original PDFs are provided by the IRS, I will just
> be plugging user input into relevant fields.
> PDF is a large file format that I don't fully understand.  I've been
> surprised before by some of the things it is capable.  So that got me
> thinking, is there any sanitation I need to perform to the user input
> before generating the PDF?  Or any special cases I should keep in mind when
> filling in forms with arbitrary strings from an untrusted source.
> Thanks in advance!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message