xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliotte Rusty Harold <elh...@metalab.unc.edu>
Subject Re: Issue with Crimson Parser
Date Sat, 06 Dec 2003 01:19:08 GMT
Hi,

I have a very serious issue, that might affect my project as a whole.

My Project involves conversion of an Input XML file to an output XML 
file format, which is predefined.
This process is done using JAVA (JDK 1.3.1) with JAXP 1.1 and Crimson 
Parser for Parsing the input file. We
use the SAX Parser of the Crimson for implementation.

My files range from 4 - 40 MB. When I try to parse a file more than 1 
MB, I find that, the parser does not read part of 
characters at some fixed places. It is happening at the same place. 
It is sure that the input file has that data in the
correct format. It happens only with the data and not on the tags. I 
mean that it is working fine for start element
and end element. It is not working for Characters alone.


1. You need to upgrade to Xerces (but this will not fix your problem).
2. See 
http://www.cafeconleche.org/books/xmljava/chapters/ch06s07.html (This 
likely will fix your problem)

In brief,  when there's a large amount of text between two tags with 
no intervening markup, the parser may choose to call characters() 
multiple times even though it doesn't need to. Xerces generally won't 
pass more than 16K of text in one call. Crimson is limited to about 
8K of text per call. At the extreme, I have even seen a parser pass a 
single character at a time to the characters() method. You must not 
assume that the parser will pass you the maximum contiguous run of 
text in a single call to characters().

-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml            
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


Mime
View raw message