Return-Path: Delivered-To: apmail-lucene-tika-dev-archive@www.apache.org Received: (qmail 49017 invoked from network); 14 Apr 2010 11:11:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 14 Apr 2010 11:11:18 -0000 Received: (qmail 48917 invoked by uid 500); 14 Apr 2010 11:11:18 -0000 Delivered-To: apmail-lucene-tika-dev-archive@lucene.apache.org Received: (qmail 48775 invoked by uid 500); 14 Apr 2010 11:11:15 -0000 Mailing-List: contact tika-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: tika-dev@lucene.apache.org Delivered-To: mailing list tika-dev@lucene.apache.org Received: (qmail 48766 invoked by uid 99); 14 Apr 2010 11:11:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 11:11:14 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 11:11:12 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3EBAoOD000072 for ; Wed, 14 Apr 2010 07:10:50 -0400 (EDT) Message-ID: <19596355.109461271243450103.JavaMail.jira@thor> Date: Wed, 14 Apr 2010 07:10:50 -0400 (EDT) From: "Jukka Zitting (JIRA)" To: tika-dev@lucene.apache.org Subject: [jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation In-Reply-To: <65583687.295451266312867911.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856839#action_12856839 ] Jukka Zitting commented on TIKA-379: ------------------------------------ The reason for the default HTML mapping rules in Tika are to simplify and normalize the input documents so that client applications could easily process all sorts of input (HTML or not) without needing type- or source-specific heuristics. The basic idea has been that clients should directly use the underlying parser libraries when it needs custom processing of specific content types. That said, I see the value of being able to process even complex HTML input through the Tika API, and perhaps the above original intent is too strict for many use cases. The HtmlMapper interface we added for TIKA-347 should make it possible to relax the mapping rules, and in revision 933909 I added a IdentityHtmlMapper implementation of this interface to make it even easier to use: ParseContext context = new ParseContext(); context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); Note that IdentityHtmlMapper breaks the guarantee that the Tika output is valid XHTML. Also, currently the HtmlMapper interface only covers elements, so all attributes are still lost and IdentityHtmlMapper overrides the custom tag handling in HtmlHandler so even the href attributes are gone. It would be good if we could extend the HtmlMapper mechanism to avoid these problems. > Html elements and attributes not available in XHTML representation > ------------------------------------------------------------------- > > Key: TIKA-379 > URL: https://issues.apache.org/jira/browse/TIKA-379 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Reporter: Julien Nioche > Priority: Critical > > The following HTML document : > document 1 titlejotain suomeksi > is rendered as the following xhtml by Tika : > </head><body>document 1 titlejotain suomeksi</body></html> > with the lang attribute getting lost. The lang is not stored in the metadata either. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira