pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: Text for Ebook Readers
Date Sun, 27 Jan 2013 14:14:08 GMT
Dear Eliot,

> So in short, it's not unreasonable but it's also not something that can be
> easily generalized. For a general solution you have to have some way to
> configure the details about the pages you're extracting text from: the
> header and footer boundaries, the number of columns, the writing system used
> (is it Hebrew or Arabic? Is it a top-to-bottom, right-to-left language?),
> and so on. 

Yes, I am aware of that, but that seems to be at least partially solved (see below).

> I implemented a pretty good paragraph recognizer some years ago using an
> earlier (but functionally equivalent for the purpose) version of PDFBox.
> Unfortunately, that code was proprietary and I no longer have access to it.
> But we were able to recognize paragraphs on pages in typical mass-market
> fiction books (we were doing conversion of PDFs to a proprietary e-reader
> format). We also had to recognize page breaks within paragraphs and do
> de-hyphenation.

This sounds like the kind of program I'd be looking for, pity it's not available. 

And Bob,

> I have struggled with the same issues, not just with free
> ebooks, but web page content, etc. The "free" books from Project
> Gutenberg are often available in plain text, and
> you can work from there. Sometimes, books offered
> by Google Books are available in plain text.

Yes, I'm aware of that, and they usually give e pretty rich choice of formats.
What bothers me is that I can get e.g. the OpenAccess books from my own university (see http://www.univerlag.uni-goettingen.de/)
only in PDF (they say it's hard to produce the various formats). And these PDFs don't display
easily on ebook readers.

> My current tool of choice is Calibre. It can read PDF and convert
> to many formats. How well? It has problems.

I tried Calibre, but wasn't really satisfied, that's why I came back to pdfbox.

My point is that pdfbox with html format is already fairly close to what I need.
In html format, breaks between paragraphs are recognised and marked by </p><p>,
while line breaks are preserved as such, but not tagged. The same distinction (e.g. by a free
line) in the text format would already go a far way into the direction I'm looking for.
Thus the paragraph recognition problem seems to be essentially solved.

What I'm missing there is the distinction between page breaks that are also paragraph breaks
and those that are only line breaks. Else I could fairly easily transform the html format
into the kind of text I am looking for, using some change and replace with regular expressions.
But I'm not sufficiently versed with either Java or the PDF format to know where I could modify
the program to handle that distinction. But probably someone else is…

View raw message