tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set
Date Thu, 09 Aug 2012 21:59:19 GMT

     [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ken Krugler resolved TIKA-889.
------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s: 1.3

Added unit test to validate in r137506
                
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>             Fix For: 1.3
>
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if
it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser
class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means
that none of the html elements in the web page will match the elements in the ENDLINE set.
 
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message