Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of soeren.pekrul@gmx.de
 designates 213.165.64.20 as permitted sender)
Message-ID: <4602DA43.1070609@gmx.de>
Date: Thu, 22 Mar 2007 20:34:27 +0100
From: Soeren Pekrul <soeren.pekrul@gmx.de>
User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Extracting formatted text from PDF files
References: <000601c76cad$0e7f3c30$0302a8c0@xpsoleary>
In-Reply-To: <000601c76cad$0e7f3c30$0302a8c0@xpsoleary>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit

Mike O'Leary wrote:
> Please forgive the laziness inherent in this question, as I haven't looked
> through the PDFBox code yet. I am wondering if that code supports extracting
> text from PDF files while preserving such things as sequences of whitespace
> between characters and other layout and formatting information. I am working
> with a project that extracts and operates on certain table-like blocks of
> text from PDF files, and a lot of freeware and shareware PDF to text
> converters seem to either ignore formatting or try to preserve formatting
> and not get it quite right. I am wondering if PDFBox provides better support
> for this kind of thing. Thanks.

That is not so simple. Usually there is not this information inside a 
PDF file. PDF is an output file format. It contains just the information 
print a character "a" at the position x and y. In many cases a PDF file 
doesn�t know even words or white spaces. We read words due to the 
position of characters, we see paragraphs due to the position of 
characters, and we see tables due to the position of characters. The 
file doesn�t contain this information.
I found this code in a PDF file for the German word "Wuchsform" (form of 
growing) and the colon ":":

/F1 1 Tf
-3.8801 -1.274 TD
[ (W) 29.60001 (uchsform:) ] TJ

First line: Select a font
Second line: Move the cursor to position -3.8801, -1.274
Third line: Print the character "W", move the cursor 29.60001 units to 
right and print the characters "uchsform:".

Extracting the words from a PDF file for indexing means you have first 
to build words from the characters positions. Recognizing paragraphs, 
column text, tables, captions, lists, footnotes etc. is much more difficult.

S�ren

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org