pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Carrier (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PDFBOX-434) Improve html output
Date Wed, 25 Feb 2009 16:55:03 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brian Carrier resolved PDFBOX-434.
----------------------------------

    Resolution: Fixed

Checked in slight variation of this patch.  The original patch would have failed for console
output. 

Note that this patch changes the PDFTextStripper.beginParagraph() and PDFTextStripper.endParagraph()
methods to PDFTextStripper.beginArticle() and PDFTextStripper.endArticle(), which are more
accurate names. PDFBox currently has no way to detect paragraph boundaries and these functions
are called at the beginning and end of each column on each page.

Sending        trunk/src/main/java/org/apache/pdfbox/ExtractText.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFText2HTML.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Transmitting file data ...
Committed revision 747858.

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The attached
file has changes that should improve the way the -html option works. Now, output files are
tagged with the .html extension. We also added <DOCTYPE> information as well as a <meta>
tag which writes the appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper
and PDFText2HTML which wasn't being used. Added ability to set the <title> tag of the
html document to be the title given in the pdf document information if it exists. Otherwise
it will guess a title from the beginning first lines of the file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message