tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set
Date Thu, 09 Aug 2012 21:47:19 GMT

    [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432189#comment-13432189
] 

Ken Krugler commented on TIKA-889:
----------------------------------

Hi John - I tried this with trunk, and it works as expected.

Yes, it's true that XHTMLDowngradeHandler will uppercase the element names, but then DefaultHtmlMapper.mapSafeElement()
lower-cases them (I know, seems odd to me too). So the comparison works, and I see the expected
output.

I'm adding a test case to validate behavior, at least for a simple <ul><li>xxx</li></ul>
example.
                
> XHTMLContentHandler wont emit newline when html element matches ENDLINE set
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-889
>                 URL: https://issues.apache.org/jira/browse/TIKA-889
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: John Conwell
>            Assignee: Ken Krugler
>
> XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if
it should emit a newline.  The html elements in ENDLINE are all lower case, but the HtmlParser
class uses the XHTMLDowngradeHandler handler to upper case all html elements.  This means
that none of the html elements in the web page will match the elements in the ENDLINE set.
 
> This also is a problem with the INDENT set as well

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message