pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Swanson <...@swansongrp.com>
Subject re: Text for eBook Readers
Date Sat, 26 Jan 2013 02:53:35 GMT
You wrote:

 >>  Hello,
 >>  since numerous free ebooks come only in PDF format I am looking
 >> for a method to transform them to text or html to make them
 >> readable on ebook readers that don't support PDF reflow.

I have struggled with the same issues, not just with free
ebooks, but web page content, etc. The "free" books from Project
Gutenberg are often available in plain text, and
you can work from there. Sometimes, books offered
by Google Books are available in plain text.

My current tool of choice is Calibre. It can read PDF and convert
to many formats. How well? It has problems. See below:

 >>  While in general ExtractText works sufficiently well for words,
 >> that doesn't hold for paragraphs. In text mode, ExtractText
 >> doesn't distinguish between the end of a line and a new
 >> paragraph (often indicated by indenting to first line of the
 >> text block), thus formatting is quite poor for text. The HTML
 >> output seems to distinguish but still suffers from embedding
 >> headers and footers into the text.
 >>  I don't have an immediate solution, but would prefer if the
 >> text output would insert a blank line at the places where the
 >> HTML output sets a paragraph tag, or a tab for the indentation
 >> at the beginning of a line. As for headers and footers, I
 >> could only imagine to set some parameter to ignore text outside
 >> of the standard type area.
 >>  Or are these unreasonably wishes?
 >>  Best
 >>  Thomas

Calibre tries to reverse-engineer the PDF content, just
as many, many programmers and users of PDFBox try to do.
It is not easy nor automatic, as many on this discussion group
will tell you. For instance, I try to follow
a publication that is issued in a 3-column newspaper format.
It is available in PDF, but frankly, there is no reasonable
way to render it into any readable form from there. The 3-column format
is mostly unreadable on a phone or tablet device, and
no extractor seems able to render it into a text flow.

Better really to start with a format that is not PDF, but
Calibre is worth a try, if you have not already tried it.

Hope this helps.

View raw message