Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (athena.apache.org: domain of kqingchao@gmail.com
 designates 209.85.192.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1407117192.73337.YahooMailNeo@web163802.mail.gq1.yahoo.com>
References: <1407117192.73337.YahooMailNeo@web163802.mail.gq1.yahoo.com>
Date: Tue, 5 Aug 2014 10:20:45 +0800
Message-ID: 
 <CANXphgrrwYS0LOY6U9E+y5srraki3fjctjOwuBacO0FQZ1G4JA@mail.gmail.com>
Subject: Re: How to find the position of a specific paragraph in the input
 PDF?
From: Qingchao Kong <kqingchao@gmail.com>
To: users@pdfbox.apache.org, "Amir H. Jadidinejad" <amir.jadidi@yahoo.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Amir,
Paragraphs are separated by "\n", so it sounds feasible to split the
text by "\n". But the text extracted from the PDF seems to contain
many "\n"s and would make it impossible to extract paragraphs. I even
don't think there is a way to do this using PDFBox.

One possible solution would be constructing classifiers to
discriminate the boundary between different paragraphs.

I also suggest you get to know  the subject "topic boundary detection".

Regards,

On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad
<amir.jadidi@yahoo.com.invalid> wrote:
>
>
> I'm going to extract the content of a PDF file using PDFBox library. The =
content should be processed paragraph-by-paragraph and for each paragraph, =
I need its position for follow-up processing. Using the following code, I c=
an extract the whole content of an input PDF:
>
> PDDocument doc =3D PDDocument.load(file);
> PDFTextStripper stripper =3D new PDFTextStripper();
> String txt =3D stripper.getText(doc);
> doc.close();
>
> I have two problems:
>
>     1. I don't know how to extract the content paragraph by paragraph.
>     2. I don't know how to store the position of a paragraph for follow-u=
p processing (for example highlighting and etc.)
>
> Thanks.


--=20
Qingchao Kong

Ph.D. Candidate
State Key Laboratory of Management and Control for Complex Systems
Institute of Automation, Chinese Academy of Sciences

No. 95 Zhongguancun East Road
Haidian District, Beijing 100190 China