pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qingchao Kong <kqingc...@gmail.com>
Subject Re: How to find the position of a specific paragraph in the input PDF?
Date Tue, 05 Aug 2014 02:20:45 GMT
Amir,
Paragraphs are separated by "\n", so it sounds feasible to split the
text by "\n". But the text extracted from the PDF seems to contain
many "\n"s and would make it impossible to extract paragraphs. I even
don't think there is a way to do this using PDFBox.

One possible solution would be constructing classifiers to
discriminate the boundary between different paragraphs.

I also suggest you get to know  the subject "topic boundary detection".

Regards,

On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad
<amir.jadidi@yahoo.com.invalid> wrote:
>
>
> I'm going to extract the content of a PDF file using PDFBox library. The content should
be processed paragraph-by-paragraph and for each paragraph, I need its position for follow-up
processing. Using the following code, I can extract the whole content of an input PDF:
>
> PDDocument doc = PDDocument.load(file);
> PDFTextStripper stripper = new PDFTextStripper();
> String txt = stripper.getText(doc);
> doc.close();
>
> I have two problems:
>
>     1. I don't know how to extract the content paragraph by paragraph.
>     2. I don't know how to store the position of a paragraph for follow-up processing
(for example highlighting and etc.)
>
> Thanks.



-- 
Qingchao Kong

Ph.D. Candidate
State Key Laboratory of Management and Control for Complex Systems
Institute of Automation, Chinese Academy of Sciences

No. 95 Zhongguancun East Road
Haidian District, Beijing 100190 China

Mime
View raw message