pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amir H. Jadidinejad" <amir.jad...@yahoo.com.INVALID>
Subject Re: How to find the position of a specific paragraph in the input PDF?
Date Tue, 05 Aug 2014 09:06:19 GMT
Dear Eliot,
I do appreciate your comprehensive response. It was really informative for me.
Thank you.
Amir



________________________________
 From: Eliot Kimber <ekimber@rsicms.com>
To: "users@pdfbox.apache.org" <users@pdfbox.apache.org> 
Sent: Tuesday, August 5, 2014 7:43 AM
Subject: Re: How to find the position of a specific paragraph in the input PDF?
 

Detecting paragraphs is a "hard problem": there is nothing inherent in the
PDF data that will reliably tell you where paragraph boundaries are. Some
PDF documents may have more reliable indicators than others, but unless
you're working with a very specific set of documents you can't depend on
it.

The only reasonably-complete solution is to analyze the x/y location of
each line of text and use a heuristic to guess at paragraph boundaries,
where the heuristic will depend on how the paragraphs are indicated in the
document at hand: extra vertical space, first line indent, etc.

In the easy case, the characters of the text line will be contiguous in
the PDF data stream. In the hard case, the characters will not be
contiguous and you will need to use each string's x/y position to build up
a single line (PDFBox may have utilities for this, I don't know). You'll
need to again use heuristics to determine that a given character is or is
not within a line (for example, superscripts and subscripts will not have
the same Y origin as other characters in the same line, but they are
definitely part of the line).

Even then, if you have a multi-column document you have the further
challenge of detecting the column boundary--if you must use the x/y
positions of the characters to detect lines horizontally, you then have to
have some way of distinguishing a normal interword space from the gap
between columns. This may require configuring your tool with the
boundaries of each column ("zoning"). Likewise, you may need to define
zones to distinguish the headers and footers from the main body content.

If you just need to reproduce the visual look of the page, say in HTML,
then it's not so hard: you just treat each separately-placed sequence of
characters as an absolutely-positioned <div> with appropriate styling
applied (which you can get from the PDF data). But if you need to try to
reconstitute the logical structure of the document, that is much harder.

If your pages are regular pages of simple text, the problem isn't too
hard. But if you have things like figures and tables then the problem
becomes harder.

If you need to detect paragraphs that span page boundaries, then you have
the challenge of distinguishing a paragraph that happens to end at the
bottom of a page from one that does not.

So there cannot be a general "get all the paragraphs in PDF"
function--even if you have general code it must be tuned with the details
of a given document or set of documents.

I know this from work I did more than 10 years ago to convert PDFs of
published books into the format used by the Sony EReader product (it used
a proprietary XML language as input). I'm sure PDFBox as improved since
then (we used it as the basis for our tool, but PDF itself has not changed
materially and certainly the tools that produced it are not necessarily
any better now than they were then.

We did pretty well with simple fiction books that had little or no content
except paragraphs, but it still required zoning and so forth.

In one document everything was coming out correctly except the first
character of the first paragraph in a chapter, which always ended up at
the end of the page.

I finally realized that that character was a dropped capitol and it
happened to be the last character in the data stream for the page, but
it's X/Y position put it first in the reading order--the typesetting
system (probably Quark at that time) put the drop cap last in the data
because it was placed by the operator after all the other content was
placed.

While it's unlikely, you could have perverse PDFs where each character is
separately drawn and the characters occur in some random order different
from the reading order.

Cheers,

Eliot
-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.reallysi.com
www.rsuitecms.com







On 8/4/14, 10:20 PM, "Qingchao Kong" <kqingchao@gmail.com> wrote:

>Amir,
>Paragraphs are separated by "\n", so it sounds feasible to split the
>text by "\n". But the text extracted from the PDF seems to contain
>many "\n"s and would make it impossible to extract paragraphs. I even
>don't think there is a way to do this using PDFBox.
>
>One possible solution would be constructing classifiers to
>discriminate the boundary between different paragraphs.
>
>I also suggest you get to know  the subject "topic boundary detection".
>
>Regards,
>
>On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad
><amir.jadidi@yahoo.com.invalid> wrote:
>>
>>
>> I'm going to extract the content of a PDF file using PDFBox library.
>>The content should be processed paragraph-by-paragraph and for each
>>paragraph, I need its position for follow-up processing. Using the
>>following code, I can extract the whole content of an input PDF:
>>
>> PDDocument doc = PDDocument.load(file);
>> PDFTextStripper stripper = new PDFTextStripper();
>> String txt = stripper.getText(doc);
>> doc.close();
>>
>> I have two problems:
>>
>>     1. I don't know how to extract the content paragraph by paragraph.
>>     2. I don't know how to store the position of a paragraph for
>>follow-up processing (for example highlighting and etc.)
>>
>> Thanks.
>
>
>
>-- 
>Qingchao Kong
>
>Ph.D. Candidate
>State Key Laboratory of Management and Control for Complex Systems
>Institute of Automation, Chinese Academy of Sciences
>
>No. 95 Zhongguancun East Road
>Haidian District, Beijing 100190 China
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message