pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Villalobos <jeremyvillalo...@gmail.com>
Subject Pdfbox
Date Sun, 10 Jul 2011 19:58:02 GMT

I am the developer of pdf-to-speech.  The app uses a couple of engines to
get the text out of PDFs.  One is Google Docs, the other is Pdfbox.

I first want to thank you for your contribution.

I have the problem that the parser in pdfbox goes through the whole file
instead of just loading the needed parts.  This is not even noticeable in a
fast desktop, but it becomes apparent on a 400Mhz ARM chip.

I want to improve the parser and offer those changes to the foundation
(adding it to pdfbox 1.4).  I should note that pdf-to-speech, as a whole is
proprietary, although I released an alpha version of the app at
http://code.google.com/p/pdftospeech/ under Apache License.

Would you be so kind to point me to your favourite websites with information
about PDF format implementation.  And any general advice can be of help as
well.  I have read through the parser code several times, but need to setup
a better testing and developing environement before I continue to work on

My preliminary plan is to add the PDFParser object to the PDDocument, and
add algorithms that will dynamically load the pdf objects from the file
directly as they are needed.  Thanks for any help.

Jeremy Villalobos, Ph.D

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message