tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation
Date Wed, 14 Apr 2010 08:18:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794

Julien Nioche commented on TIKA-379:

This is indeed a more generic problem. It also affects HTML elements like *link* which are
commonly used in head sections to specify favicons or canonical representations. These values
are not stored in the metadata  either and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I agree with Ken
that it would be better not only to store information in the metadata but also to be able
to retrieve them from the SAX events. 

Any thoughts on this?

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document
1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message