xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bentley Drake <bent...@nightfire.com>
Subject RE: JAXP/Crimson whitespace issue question
Date Sun, 02 Dec 2001 21:43:54 GMT
Edwin,

How about the following approach:

In class org.apache.crimson.tree.ParentNode, the following method can be
found:

    public void writeChildrenXml (XmlWriteContext context) throws
IOException

This method writes out child nodes in 'pretty-print' fashion to the stream
Writer associated with the context passed-in.

I added code to check for TEXT nodes that are all whitespace when not
preserving whitespace.  If a node meets these conditions, a single space
character is substituted for the node's current value.  (See the code
between the BEGIN ADDITIONS/END ADDITIONS text below.)  This prevents the
whitespace from growing by an indentation amount when the same XML is
repeatedly parsed and then streamed out.  It also removes non-indentation
related TEXT nodes that are all whitespace undless the xml:space='preserve'
attribute is set.


    public void writeChildrenXml (XmlWriteContext context) throws
IOException
    {
	if (children == null)
	    return;

	int	oldIndent = 0;
	boolean	preserve = true;
	boolean	pureText = true;

	if (getNodeType () == ELEMENT_NODE) {
	    preserve = "preserve".equals (
		    getInheritedAttribute ("xml:space"));
	    oldIndent = context.getIndentLevel ();
	}

	try {
	    if (!preserve)
		context.setIndentLevel (oldIndent + 2);
	    for (int i = 0; i < length; i++) {
		if (!preserve && children [i].getNodeType () != TEXT_NODE) {
		    context.printIndent ();
		    pureText = false;
		}
        
        // BEGIN ADDITIONS.
        // If we're not preserving whitespace, and the current node 
        // is a TEXT node containing whitespace and nothing else, skip it.
        // if ( !preserve && (children [i].getNodeType() == TEXT_NODE) &&
context.isIndent(children [i].getNodeValue()) )
        if ( !preserve && (children [i].getNodeType() == TEXT_NODE) &&
isWhitespace(children [i].getNodeValue()) )
        {
            // Normalize whitespace to one space character?
            Writer	out = context.getWriter ();
            out.write( ' ' );
            continue;
        }
        // END ADDITIONS.

        children [i].writeXml (context);
	    }
	} finally {
	    if (!preserve) {
		context.setIndentLevel (oldIndent);
		if (!pureText)
		    context.printIndent ();		// for ETag
	    }
	}
    }


The isWhitespace() method just checks the String value to see if it consists
entirely of whitespace:


    // BEGIN ADDITIONS.
    private boolean isWhitespace ( String value )
    {
        if ( (value == null) || (value.length() == 0) )
            return false;

        int len = value.length( );

	    for (int i = 0; i < len; i++) 
        {
            // Character.isSpaceChar() doesn't work here.
            // Should the following check - taken from method
removeWhiteSpaces() - be used instead?
            // if (c == ' ' || c == '\t' || c == '\n' || c == '\r')
            if ( !Character.isWhitespace( value.charAt(i) ) )
                return false;
        }
        
        return true;
    }
    // END ADDITIONS.



I tried making the code more discriminating by adding an isIndent() method
to the XmlWriteContext class that checks for whitespace in a fashion similar
to that used by the printIndent() to generate it, but the algorithm failed
due to line separator problems (I'm testing on Win-NT).  See the code below.


    // BEGIN ADDITION
    public boolean isIndent ( String value ) throws IOException
    {
        if ( value == null )
            return false;

        if (!prettyOutput)
            return false;

        // NOTE: The following code won't work unless the line separator
        // is obtained properly!!!  (See "line.separator" system property.)

        // Check that value is long enough to be indentation.
        if ( value.length() < (XmlDocument.eol.length() + indentLevel) )
            return false;

        // Check that value starts with EOL.
        if( !value.startsWith( XmlDocument.eol ) )
            return false;

        // Advance past EOL.
        value = value.substring( XmlDocument.eol.length() );
        
        for ( int i = 0; i < indentLevel; i++ )
        {
            if ( value.charAt(i) != ' ' )
                return false;
        }
        
        return true;
    }
    // BEGIN ADDITION


I tested the code using my unit-tests and it works for my application, but I
don't know if it breaks DOM conformance, as I don't have an exhaustive
conformance test suite. I've attached the modified source files (ParentNode,
XmlWriteContext).

Thanks,

Bentley



-----Original Message-----
From: Edwin Goei [mailto:edwingo@sun.com]
Sent: Saturday, December 01, 2001 11:53 AM
To: Bentley Drake
Cc: 'general@xml.apache.org'
Subject: Re: JAXP/Crimson whitespace issue question


Bentley Drake wrote:
> 
> Hello,
> 
> I'm noticing that the Crimson library seems to insert an awful lot of
> whitespace into XML text streams that it generates from DOM.  This
> whitespace doesn't appear if a DOCTYPE reference is present, since the
code
> can use a validating parser and can therefore call the
> DocumentBuilderFactory.setIgnoringElementContentWhitespace() method.  I've
> attached a Java class that parses and regenerates the XML text stream
> (string->DOM->string), along with an example XML.  Here's an example of
the
> output when the program is run against an XML document in a cyclic
fashion:

Not sure I have time to work on this now.  Maybe you can provide a
patch.  It looks like the code is using internal, non-JAXP APIs.  If you
have Xalan or the latest version of the JAXP RI 1.1.3, then you can use
the TrAX API to do what you want.  See my unofficial JAXP FAQ at
http://xml.apache.org/~edwing/jaxp-faq.html for more info.

-Edwin


Mime
View raw message