pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-1502) Not Extracting Text from PDF Document
Date Thu, 02 Jan 2014 18:05:52 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860415#comment-13860415
] 

Andreas Lehmkühler commented on PDFBOX-1502:
--------------------------------------------

OK, I try to summarize all prior comments:

- PDFBox does extract all text of a pdf (if possible) excluding form values, annotations,
metadata etc.
- have a look at the [PrintFields|http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/fdf/PrintFields.java]
example on how to extract form values
- updated pdfs, like the edited one, have to read using the non-sequential parser (use -nonSeq
as commandline option / use PDDocument#loadNonSeq instead of PDDocument#load within your own
code) as the old can't handle incremental updates

If there are any further questions, please address those to our [mailing lists|http://pdfbox.apache.org/mailinglists.html].
We don't use JIRA as A+Q-tool.



> Not Extracting Text from PDF Document
> -------------------------------------
>
>                 Key: PDFBOX-1502
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1502
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator, 1.7.1, 1.8.0
>         Environment: Mac OS , jdk 1.7
>            Reporter: deepak
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX1502-RenewalAdvice.txt, Renewal Advice .pdf, Renewal_Advice_Edited.pdf,
Renewal_Advice_Edited_Extracted_Text.txt
>
>
> PDDocument  document = PDDocument.load(Inputstream);
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(document)   is not returning some text content in the attached PDF Document
. It is just returning the form fields but the values are empty .  The bug is reproducible
both in 1.8.0-Snapshot and 1.7.1 codebase.
> Please help in resolving the issue



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message