pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Orbel Mkrtchyan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PDFBOX-2053) Issue with PDFBox position reading
Date Fri, 02 May 2014 09:39:14 GMT
Orbel Mkrtchyan created PDFBOX-2053:

             Summary: Issue with PDFBox position reading
                 Key: PDFBOX-2053
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.8.3
            Reporter: Orbel Mkrtchyan

Using PDFBox 1.8.4,
bug #1:
		PDDocument doc = new PDDocument();

The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader.

bug #2: consider the following code snippet. The code runs like this:
      Extractor extractor = new Extractor();
      extractor.writeText(pdDoc, output);

Using the code defined like this:

public class Extractor extends PDFTextStripper {
    protected void writePage() throws IOException
        for( int i = 0; i < charactersByArticle.size(); i++)
            List<TextPosition> textList = charactersByArticle.get( i );
            Iterator textIter = textList.iterator();
            while( textIter.hasNext() )
                TextPosition position = (TextPosition)textIter.next();

In the given piece of code, position variable correctly iterates through the letters of the
first line of the provided pdf document, but its coordinates (x, y, widths, etc) are always
the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's
length always equals 1. So we get the same coordinates for every letter in a line. Expected
behaviour is either having new coordinates per letter or having widths[] contain widths for
the characters of a whole line of text

This message was sent by Atlassian JIRA

View raw message