tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation
Date Wed, 14 Apr 2010 11:34:52 GMT

    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856848#action_12856848

Julien Nioche commented on TIKA-379:

thanks for your comments.
I had seen the HTMLMapper but as I pointed out 
There is actually a special treatment for the elements in HEAD done in the class HtmlHandler
so simply adding *link* to the HTMLMapper does not solve the problem.
I will send a patch later today which modifies the HTMLMapper to make it generate LINK elements
in the XHTML output. This is a reasonable thing to do as this entity is allowed in the XHTML
I will look at the HTMLMapper later to see how we could get it to keep the href attributes


> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document
1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message