pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: Issue with PDFTextStripper.java and text positions
Date Mon, 09 Dec 2013 11:01:41 GMT
Hi,

> Andrew Phillips <andyphillips404@gmail.com> hat am 6. Dezember 2013 um 23:05
> geschrieben:
>
>
> Working with the PDFTextStripper.class, i found a bug in the code.  I’d love
> to contribute the fix, but not sure the best way to do that.   I am an
> experienced programmer, but have never contributed to open source activities
> (yet, although I should consider I take advantage of such).

Thanks for your interest in PDFBox and your offer to help. We are using JIRA [1]
to handle any changes,
such as issues, improvements etc. YOu have to create an user (it's free) and
create an issue. Choose a
reasonable title, add a description and attach a sample pdf if possible. Patches
should be created as
diff against the current trunk and attached to the issue as well. That's it.

>
> So, I found in a PDF I was pulling text from by using a custom PDFTextStripper
> subclass that overrides writeString(String text, List<TextPosition>
> textPositions) that i was getting the wrong textPositions that were not lined
> up with the text.   I found that the test position of all “words” in a line
> always come over as the “last” text positions of the last word in the line. 
>  I found the issue in the PDFTextStripper class
>
> So here is the Code Issue:
>
>     /**
>      * Used within {@link #normalize(List, boolean, boolean)} to handle a
>{@link TextPosition}.
>      * @return The StringBuilder that must be used when calling this method.
>      */
>     private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>normalized,
>             StringBuilder lineBuilder, List<TextPosition> wordPositions,
>TextPosition text)
>     {
>         if (text instanceof WordSeparator)
>         {
>             normalized.add(createWord(lineBuilder.toString(), wordPositions));
>             lineBuilder = new StringBuilder();
>             wordPositions.clear();
>         }
>         else
>         {
>             lineBuilder.append(text.getCharacter());
>             wordPositions.add(text);
>         }
>         return lineBuilder;
>     }
>
>
> When the normalizeAdd method, you create a new word passing the
> wordPositions.   A reference to the wordPositions is stored in the new
> WordWithTextPositions in the normalized linked list, but in the next line, you
> clear().   Since the last wordPositions was passed as a reference, the
> wordPositions is cleared in the WordWithTextPositions you just created.
>
> Soo, i would suggest you do the following:
>
>     /**
>      * Used within {@link #normalize(List, boolean, boolean)} to handle a
>{@link TextPosition}.
>      * @return The StringBuilder that must be used when calling this method.
>      */
>     private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>normalized,
>             StringBuilder lineBuilder, List<TextPosition> wordPositions,
>TextPosition text)
>     {
>         if (text instanceof WordSeparator)
>         {
>             normalized.add(createWord(lineBuilder.toString(), wordPositions));
>             lineBuilder = new StringBuilder();
>             wordPositions = new ArrayList<TextPosition>();
>         }
>         else
>         {
>             lineBuilder.append(text.getCharacter());
>             wordPositions.add(text);
>         }
>         return lineBuilder;
>     }
>
>
> This will fix the issue.   I would be more than happy to add this, but as I
> mentioned, I am not really experienced in contributing to open source
> projects.

Sounds reasonable!

> Thanks!
> Andy Phillips

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX

Mime
View raw message