Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9AB79D258 for ; Thu, 9 Aug 2012 21:59:19 +0000 (UTC) Received: (qmail 9523 invoked by uid 500); 9 Aug 2012 21:59:19 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 9487 invoked by uid 500); 9 Aug 2012 21:59:19 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 9478 invoked by uid 99); 9 Aug 2012 21:59:19 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Aug 2012 21:59:19 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 35423142819 for ; Thu, 9 Aug 2012 21:59:19 +0000 (UTC) Date: Thu, 9 Aug 2012 21:59:19 +0000 (UTC) From: "Ken Krugler (JIRA)" To: dev@tika.apache.org Message-ID: <369298452.3945.1344549559220.JavaMail.jiratomcat@issues-vm> In-Reply-To: <376429716.39340.1333133667815.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Resolved] (TIKA-889) XHTMLContentHandler wont emit newline when html element matches ENDLINE set MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-889. ------------------------------ Resolution: Cannot Reproduce Fix Version/s: 1.3 Added unit test to validate in r137506 > XHTMLContentHandler wont emit newline when html element matches ENDLINE set > --------------------------------------------------------------------------- > > Key: TIKA-889 > URL: https://issues.apache.org/jira/browse/TIKA-889 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: John Conwell > Assignee: Ken Krugler > Fix For: 1.3 > > > XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline. The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements. This means that none of the html elements in the web page will match the elements in the ENDLINE set. > This also is a problem with the INDENT set as well -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira