Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7AC97185A5 for ; Thu, 17 Dec 2015 14:45:47 +0000 (UTC) Received: (qmail 36063 invoked by uid 500); 17 Dec 2015 14:45:47 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 36023 invoked by uid 500); 17 Dec 2015 14:45:47 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 35962 invoked by uid 99); 17 Dec 2015 14:45:47 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2015 14:45:47 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 21C722C1F72 for ; Thu, 17 Dec 2015 14:45:47 +0000 (UTC) Date: Thu, 17 Dec 2015 14:45:47 +0000 (UTC) From: "Andreas Meier (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Issue Comment Deleted] (PDFBOX-2998) Enhance the text extraction capabilities MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Meier updated PDFBOX-2998: ---------------------------------- Comment: was deleted (was: I think it is the right place to comment. Writing an algorithm to detect definite blocks of text should not be that hard, but those algorithms might fail in more complex scenarios. If you can separate your code from the PDFTextStripper I will have a look at it. ) > Enhance the text extraction capabilities > ---------------------------------------- > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Andreas Meier > Attachments: DropCapExample1.pdf, DropCapExample2.pdf, DropCapExample3.pdf, DropCapExample4.pdf, DropCapExample5.pdf, DropCapSegmentation.jpg, TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org