Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 34595 invoked from network); 22 Mar 2007 19:34:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Mar 2007 19:34:35 -0000 Received: (qmail 29964 invoked by uid 500); 22 Mar 2007 19:34:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 29934 invoked by uid 500); 22 Mar 2007 19:34:35 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 29919 invoked by uid 99); 22 Mar 2007 19:34:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Mar 2007 12:34:35 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of soeren.pekrul@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 22 Mar 2007 12:34:26 -0700 Received: (qmail invoked by alias); 22 Mar 2007 19:34:05 -0000 Received: from p548C794F.dip.t-dialin.net (EHLO [10.0.1.102]) [84.140.121.79] by mail.gmx.net (mp032) with SMTP; 22 Mar 2007 20:34:05 +0100 X-Authenticated: #3493418 X-Provags-ID: V01U2FsdGVkX186A+MXsHU6R3zPuCLVHelEHkV0LAvIfoxOTM5MZk MBL3TBMQjCzYz9 Message-ID: <4602DA43.1070609@gmx.de> Date: Thu, 22 Mar 2007 20:34:27 +0100 From: Soeren Pekrul User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: de-DE, de, en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Extracting formatted text from PDF files References: <000601c76cad$0e7f3c30$0302a8c0@xpsoleary> In-Reply-To: <000601c76cad$0e7f3c30$0302a8c0@xpsoleary> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 X-Virus-Checked: Checked by ClamAV on apache.org Mike O'Leary wrote: > Please forgive the laziness inherent in this question, as I haven't looked > through the PDFBox code yet. I am wondering if that code supports extracting > text from PDF files while preserving such things as sequences of whitespace > between characters and other layout and formatting information. I am working > with a project that extracts and operates on certain table-like blocks of > text from PDF files, and a lot of freeware and shareware PDF to text > converters seem to either ignore formatting or try to preserve formatting > and not get it quite right. I am wondering if PDFBox provides better support > for this kind of thing. Thanks. That is not so simple. Usually there is not this information inside a PDF file. PDF is an output file format. It contains just the information print a character "a" at the position x and y. In many cases a PDF file doesn�t know even words or white spaces. We read words due to the position of characters, we see paragraphs due to the position of characters, and we see tables due to the position of characters. The file doesn�t contain this information. I found this code in a PDF file for the German word "Wuchsform" (form of growing) and the colon ":": /F1 1 Tf -3.8801 -1.274 TD [ (W) 29.60001 (uchsform:) ] TJ First line: Select a font Second line: Move the cursor to position -3.8801, -1.274 Third line: Print the character "W", move the cursor 29.60001 units to right and print the characters "uchsform:". Extracting the words from a PDF file for indexing means you have first to build words from the characters positions. Recognizing paragraphs, column text, tables, captions, lists, footnotes etc. is much more difficult. S�ren --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org