pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eliot Kimber <ekim...@rsicms.com>
Subject Re: Text for Ebook Readers
Date Wed, 23 Jan 2013 14:27:39 GMT
Paragraph recognition from PDF content is a hard problem. For a complete
solution you have to consider the placement geometry of each text string to
see what it's visual relationship is to other bits. There is nothing in PDF
that requires that the order of text in the PDF data stream bear any
relationship to the reading order. For example, I had a document where the
first character of the first paragraph was at the end of the page's data
stream because it was a drop cap that was the last thing created on that
page in the desktop publishing system that created it. For Hebrew and Arabic
the data stream order of the characters will be the reverse of reading
order, at least in all the samples I processed (we were converting Arabic
invoices into HTML as part of a data conversion workflow).

You can use clues, like consistent horizontal position and consistent
vertical spacing to recognize paragraph contents and paragraph breaks, but
it's challenging and ultimately depends on fuzzy heuristics. And it really
only works for text laid out in simple columns without sidebars, floated
areas, and so on. It also requires the ability to recognize page headers and
footers.

So in short, it's not unreasonable but it's also not something that can be
easily generalized. For a general solution you have to have some way to
configure the details about the pages you're extracting text from: the
header and footer boundaries, the number of columns, the writing system used
(is it Hebrew or Arabic? Is it a top-to-bottom, right-to-left language?),
and so on. 

An alternative (and what we did for the Arabic docs) was generate HTML that
uses absolute positioning to simply place the text on the HTML page in the
same relative location as defined in the PDF. This gives you an accurate
reproduction of the original page (to the degree HTML allows it) but doesn't
give you any reflowability.

I implemented a pretty good paragraph recognizer some years ago using an
earlier (but functionally equivalent for the purpose) version of PDFBox.
Unfortunately, that code was proprietary and I no longer have access to it.
But we were able to recognize paragraphs on pages in typical mass-market
fiction books (we were doing conversion of PDFs to a proprietary e-reader
format). We also had to recognize page breaks within paragraphs and do
de-hyphenation.

If I remember the solution, we essentially got the x/y position of each text
sequence and then ordered them by the reading order and then analyzed their
geometric positions to detect paragraphs, line breaks, and so on. This
included detecting column boundaries using a heuristic like "if the
horizontal distance between two lines is more than {likely column gap}
assume a column break".

Cheers,

Eliot

On 1/23/13 8:05 AM, "Thomas Fischer" <fischer.th@aon.at> wrote:

> Hello,
> 
> since numerous free ebooks come only in PDF format I am looking for a method
> to transform them to text or html to make them readable on ebook readers that
> don't support PDF reflow.
> 
> While in general ExtractText works sufficiently well for words, that doesn't
> hold for paragraphs. In text mode, ExtractText doesn't distinguish between the
> end of a line and a new paragraph (often indicated by indenting to first line
> of the text block), thus formatting is quite poor for text. The HTML output
> seems to distinguish but still suffers from embedding headers and footers into
> the text.
> 
> I don't have an immediate solution, but would prefer if the text output would
> insert a blank line at the places where the HTML output sets a paragraph tag,
> or a tab for the indentation at the beginning of a line. As for headers and
> footers, I could only imagine to set some parameter to ignore text outside of
> the standard type area.
> 
> Or are these unreasonably wishes?
> 
> Best
> Thomas
> 
> 

-- 
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/


Mime
View raw message