tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Graessle (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-973) PDF form data isn't included in extracted content.
Date Thu, 09 Aug 2012 19:54:19 GMT
Michael Graessle created TIKA-973:
-------------------------------------

             Summary: PDF form data isn't included in extracted content.
                 Key: TIKA-973
                 URL: https://issues.apache.org/jira/browse/TIKA-973
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.2
            Reporter: Michael Graessle
            Priority: Minor


When extracting content from PDFs, PDF form data isn't extracted. 

The following code extracts this data via PDF box, but it seems like something Tika should
be doing.

PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
  PDAcroForm acroForm = docCatalog.getAcroForm();
  if (acroForm != null) {
	@SuppressWarnings("unchecked")
	List<PDField> fields = acroForm.getFields();
	if (fields != null && fields.size() > 0) {
	  documentContent.append(" ");
	  for (PDField field : fields) {
		if (field.getValue()!=null) {
		  documentContent.append(field.getValue());
		  documentContent.append(" ");
		}
	  }
	}
  }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message