Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F21EA11E5F for ; Tue, 5 Aug 2014 02:21:11 +0000 (UTC) Received: (qmail 13788 invoked by uid 500); 5 Aug 2014 02:21:11 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 13763 invoked by uid 500); 5 Aug 2014 02:21:11 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 13750 invoked by uid 99); 5 Aug 2014 02:21:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Aug 2014 02:21:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kqingchao@gmail.com designates 209.85.192.50 as permitted sender) Received: from [209.85.192.50] (HELO mail-qg0-f50.google.com) (209.85.192.50) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Aug 2014 02:21:05 +0000 Received: by mail-qg0-f50.google.com with SMTP id q108so323622qgd.37 for ; Mon, 04 Aug 2014 19:20:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=4tEmG+18nL5oUrlBIE20eKrOviDEZ2ewVAkzq3ChRZM=; b=CUQJhfvvqokyQ07WF4PGWtulVEKzC2bxpPd3HLxfcvCOBqx+WEAoE6LsK2++r96bw5 3Y4AMVhtalTZ0pHUyq5qwELSq6FED24+DApWMsdJ5Xcln68F0o4q4q1N1iUFsv8Pnyil liU43bPssCoC7xuCkTojkS12zoVXBboDl55RsxK8AL56G6NG+gz89a41ZpAhaf7bnsKA 9TTU6nfkdkhh+NE4KnQ9pLc7BSFw9401lFWx0eIG6oSraOYXGb0En3uPetQk1s8PAban TreVbX1168JKvPWaM6bx2o70SRAnuoIsRiwgC1g/vj45BpTn5LlVaUpnwx2gL6fcdZP5 f8rg== MIME-Version: 1.0 X-Received: by 10.140.101.86 with SMTP id t80mr669996qge.91.1407205245145; Mon, 04 Aug 2014 19:20:45 -0700 (PDT) Received: by 10.140.105.131 with HTTP; Mon, 4 Aug 2014 19:20:45 -0700 (PDT) In-Reply-To: <1407117192.73337.YahooMailNeo@web163802.mail.gq1.yahoo.com> References: <1407117192.73337.YahooMailNeo@web163802.mail.gq1.yahoo.com> Date: Tue, 5 Aug 2014 10:20:45 +0800 Message-ID: Subject: Re: How to find the position of a specific paragraph in the input PDF? From: Qingchao Kong To: users@pdfbox.apache.org, "Amir H. Jadidinejad" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Amir, Paragraphs are separated by "\n", so it sounds feasible to split the text by "\n". But the text extracted from the PDF seems to contain many "\n"s and would make it impossible to extract paragraphs. I even don't think there is a way to do this using PDFBox. One possible solution would be constructing classifiers to discriminate the boundary between different paragraphs. I also suggest you get to know the subject "topic boundary detection". Regards, On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad wrote: > > > I'm going to extract the content of a PDF file using PDFBox library. The = content should be processed paragraph-by-paragraph and for each paragraph, = I need its position for follow-up processing. Using the following code, I c= an extract the whole content of an input PDF: > > PDDocument doc =3D PDDocument.load(file); > PDFTextStripper stripper =3D new PDFTextStripper(); > String txt =3D stripper.getText(doc); > doc.close(); > > I have two problems: > > 1. I don't know how to extract the content paragraph by paragraph. > 2. I don't know how to store the position of a paragraph for follow-u= p processing (for example highlighting and etc.) > > Thanks. --=20 Qingchao Kong Ph.D. Candidate State Key Laboratory of Management and Control for Complex Systems Institute of Automation, Chinese Academy of Sciences No. 95 Zhongguancun East Road Haidian District, Beijing 100190 China