pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] Resolved: (PDFBOX-582) Ignoring text over images
Date Mon, 05 Apr 2010 17:37:27 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andreas Lehmkühler resolved PDFBOX-582.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2.0

I've attached the final result (PDFBOX582-pg_00051.png). The imageIO library, which isn't
part of pdfbox, has to be used to render the embedded tiff.

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>             Fix For: 1.2.0
>
>         Attachments: PageDrawer.patch, PDFBOX582-pg_00051.png, pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in scanned
form. However, sometimes they seem to have conducted OCR, and added the recovered text as
an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible
to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader
3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part,
which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part
and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part
and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message