pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Orbel Mkrtchyan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-2053) Issue with PDFBox position reading
Date Fri, 02 May 2014 10:07:15 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Orbel Mkrtchyan updated PDFBOX-2053:
------------------------------------

    Attachment: test.pdf

> Issue with PDFBox position reading
> ----------------------------------
>
>                 Key: PDFBOX-2053
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.3
>            Reporter: Orbel Mkrtchyan
>         Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
> 		PDDocument doc = new PDDocument();
> 		doc.load("test-pcc7247.pdf");
> 		doc.save("out.pdf");
> 		doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
>       Extractor extractor = new Extractor();
>       extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
>     protected void writePage() throws IOException
>     {
>         for( int i = 0; i < charactersByArticle.size(); i++)
>         {
>             List<TextPosition> textList = charactersByArticle.get( i );
>             Iterator textIter = textList.iterator();
>             while( textIter.hasNext() )
>             {
>                 TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the letters
of the first line of the provided pdf document, but its coordinates (x, y, widths, etc) are
always the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's
length always equals 1. So we get the same coordinates for every letter in a line. Expected
behaviour is either having new coordinates per letter or having widths[] contain widths for
the characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message