Return-Path: Delivered-To: apmail-cocoon-dev-archive@www.apache.org Received: (qmail 54812 invoked from network); 3 Feb 2007 08:01:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Feb 2007 08:01:29 -0000 Received: (qmail 3244 invoked by uid 500); 3 Feb 2007 08:01:35 -0000 Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 2803 invoked by uid 500); 3 Feb 2007 08:01:33 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: Reply-To: dev@cocoon.apache.org List-Id: Delivered-To: mailing list dev@cocoon.apache.org Received: (qmail 2786 invoked by uid 99); 3 Feb 2007 08:01:33 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Feb 2007 00:01:33 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Feb 2007 00:01:25 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C15847141F2 for ; Sat, 3 Feb 2007 00:01:05 -0800 (PST) Message-ID: <19257162.1170489665789.JavaMail.jira@brutus> Date: Sat, 3 Feb 2007 00:01:05 -0800 (PST) From: "Abbas Mousavi (JIRA)" To: dev@cocoon.apache.org Subject: [jira] Commented: (COCOON-2002) HTML transformer only works with latin-1 characters In-Reply-To: <32471879.1170489545503.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/COCOON-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469945 ] Abbas Mousavi commented on COCOON-2002: --------------------------------------- this change in org.apache.cocoon.transformation.HTMLTransformer solved the problem, the change is near line 173 >>> new ByteArrayInputStream(text.getBytes("UTF-8")); --------------------------------------------------------------------------------------- /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.cocoon.transformation; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.PrintWriter; import java.io.StringWriter; import java.util.HashMap; import java.util.Map; import java.util.Properties; import java.util.StringTokenizer; import org.apache.avalon.framework.configuration.Configurable; import org.apache.avalon.framework.configuration.Configuration; import org.apache.avalon.framework.configuration.ConfigurationException; import org.apache.avalon.framework.parameters.Parameters; import org.apache.cocoon.ProcessingException; import org.apache.cocoon.environment.SourceResolver; import org.apache.cocoon.transformation.AbstractSAXTransformer; import org.apache.cocoon.xml.XMLUtils; import org.apache.cocoon.xml.IncludeXMLConsumer; import org.apache.excalibur.source.Source; import org.w3c.tidy.Tidy; import org.xml.sax.Attributes; import org.xml.sax.SAXException; /** * Converts (escaped) HTML snippets into JTidied HTML. * This transformer expects a list of elements, passed as comma separated * values of the "tags" parameter. It records the text enclosed in such * elements and pass it thru JTidy to obtain valid XHTML. * *

TODO: Add namespace support. *

WARNING: This transformer should be considered unstable. * * @author Daniele Madama * @author Gianugo Rabellino * * @version CVS $Id: HTMLTransformer.java 433543 2006-08-22 06:22:54Z crossley $ */ public class HTMLTransformer extends AbstractSAXTransformer implements Configurable { /** * Properties for Tidy format */ private Properties properties; /** * Tags that must be normalized */ private Map tags; /** * React on endElement calls that contain a tag to be * tidied and run Jtidy on it, otherwise passthru. * * @see org.xml.sax.ContentHandler#endElement(java.lang.String, java.lang.String, java.lang.String) */ public void endElement(String uri, String name, String raw) throws SAXException { if (this.tags.containsKey(name)) { String toBeNormalized = this.endTextRecording(); try { this.normalize(toBeNormalized); } catch (ProcessingException e) { e.printStackTrace(); } } super.endElement(uri, name, raw); } /** * Start buffering text if inside a tag to be normalized, * passthru otherwise. * * @see org.xml.sax.ContentHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes) */ public void startElement( String uri, String name, String raw, Attributes attr) throws SAXException { super.startElement(uri, name, raw, attr); if (this.tags.containsKey(name)) { this.startTextRecording(); } } /** * Configure this transformer, possibly passing to it * a jtidy configuration file location. */ public void configure(Configuration config) throws ConfigurationException { super.configure(config); String configUrl = config.getChild("jtidy-config").getValue(null); if (configUrl != null) { org.apache.excalibur.source.SourceResolver resolver = null; Source configSource = null; try { resolver = (org.apache.excalibur.source.SourceResolver) this.manager.lookup(org.apache.excalibur.source.SourceResolver.ROLE); configSource = resolver.resolveURI(configUrl); if (getLogger().isDebugEnabled()) { getLogger().debug( "Loading configuration from " + configSource.getURI()); } this.properties = new Properties(); this.properties.load(configSource.getInputStream()); } catch (Exception e) { getLogger().warn("Cannot load configuration from " + configUrl); throw new ConfigurationException( "Cannot load configuration from " + configUrl, e); } finally { if (null != resolver) { this.manager.release(resolver); resolver.release(configSource); } } } } /** * The beef: run JTidy on the buffered text and stream * the result * * @param text the string to be tidied */ private void normalize(String text) throws ProcessingException { try { // Setup an instance of Tidy. Tidy tidy = new Tidy(); tidy.setXmlOut(true); if (this.properties == null) { tidy.setXHTML(true); } else { tidy.setConfigurationFromProps(this.properties); } //Set Jtidy warnings on-off tidy.setShowWarnings(getLogger().isWarnEnabled()); //Set Jtidy final result summary on-off tidy.setQuiet(!getLogger().isInfoEnabled()); //Set Jtidy infos to a String (will be logged) instead of System.out StringWriter stringWriter = new StringWriter(); PrintWriter errorWriter = new PrintWriter(stringWriter); tidy.setErrout(errorWriter); // Extract the document using JTidy and stream it. ByteArrayInputStream bais = new ByteArrayInputStream(text.getBytes("UTF-8")); org.w3c.dom.Document doc = tidy.parseDOM(new BufferedInputStream(bais), null); // FIXME: Jtidy doesn't warn or strip duplicate attributes in same // tag; stripping. XMLUtils.stripDuplicateAttributes(doc, null); errorWriter.flush(); errorWriter.close(); if (getLogger().isWarnEnabled()) { getLogger().warn(stringWriter.toString()); } IncludeXMLConsumer.includeNode(doc, this.contentHandler, this.lexicalHandler); } catch (Exception e) { throw new ProcessingException( "Exception in HTMLTransformer.normalize()", e); } } /** * Setup this component, passing the tag names to be tidied. */ public void setup( SourceResolver resolver, Map objectModel, String src, Parameters par) throws ProcessingException, SAXException, IOException { super.setup(resolver, objectModel, src, par); String tagsParam = par.getParameter("tags", ""); if (getLogger().isDebugEnabled()) { getLogger().debug("tags: " + tagsParam); } this.tags = new HashMap(); StringTokenizer tokenizer = new StringTokenizer(tagsParam, ","); while (tokenizer.hasMoreElements()) { String tok = tokenizer.nextToken().trim(); this.tags.put(tok, tok); } } } > HTML transformer only works with latin-1 characters > ---------------------------------------------------- > > Key: COCOON-2002 > URL: https://issues.apache.org/jira/browse/COCOON-2002 > Project: Cocoon > Issue Type: Bug > Components: Blocks: HTML > Affects Versions: 2.1.10, 2.1.11-dev (Current SVN) > Reporter: Abbas Mousavi > Priority: Critical > > when transforming HTML in encodings other than latin-1 > the result is a page of question mark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.