pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Phillips <andyphillips...@gmail.com>
Subject Issue with PDFTextStripper.java and text positions
Date Fri, 06 Dec 2013 22:10:28 GMT
Working with the PDFTextStripper.class, i found a bug in the code.  I’d love to contribute
the fix, but not sure the best way to do that.   I am an experienced programmer, but have
never contributed to open source activities (yet, although I should consider I take advantage
of such).

So, I found in a PDF I was pulling text from by using a custom PDFTextStripper subclass that
overrides writeString(String text, List<TextPosition> textPositions) that i was getting
the wrong textPositions that were not lined up with the text.   I found that the test position
of all “words” in a line always come over as the “last” text positions of the last
word in the line.   I found the issue in the PDFTextStripper class

So here is the Code Issue:

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition
text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), wordPositions));
            lineBuilder = new StringBuilder();
            wordPositions.clear();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


When the normalizeAdd method, you create a new word passing the wordPositions.   A reference
to the wordPositions is stored in the new WordWithTextPositions in the normalized linked list,
but in the next line, you clear().   Since the last wordPositions was passed as a reference,
the wordPositions is cleared in the WordWithTextPositions you just created.

Soo, i would suggest you do the following:

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition
text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), wordPositions));
            lineBuilder = new StringBuilder();
            wordPositions = new ArrayList<TextPosition>();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


This will fix the issue.   I would be more than happy to add this, but as I mentioned, I am
not really experienced in contributing to open source projects.

Thanks!
Andy Phillips
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message