tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-379) Html elements and attributes not available in XHTML representation
Date Wed, 14 Apr 2010 08:40:50 GMT

    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794
] 

Julien Nioche edited comment on TIKA-379 at 4/14/10 4:40 AM:
-------------------------------------------------------------

This is indeed a more generic problem. It also affects HTML elements like *link* which are
commonly used in head sections to specify favicons or canonical representations. These values
are not stored in the metadata  either and are vital for a crawler.

I agree with Ken that it would be better not only to store information in the metadata but
also to be able to retrieve them from the SAX events. 

Looks like this is due to the filtering done in DefaultHTMLMapper which can be overriden in
the Context so we could simply pass a less restrictive filter.  The default mapper is based
on [http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd] which allows *link* elements within
the *head* so we could add it to _mapSafeElement()_, however as there are no restrictions
on the hierarchy this would mean that such elements would also be allowed within the *body*.

Any thoughts?





      was (Author: jnioche):
    This is indeed a more generic problem. It also affects HTML elements like *link* which
are commonly used in head sections to specify favicons or canonical representations. These
values are not stored in the metadata  either and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I agree with Ken
that it would be better not only to store information in the metadata but also to be able
to retrieve them from the SAX events. 

Any thoughts on this?




  
> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document
1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message