uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wiesel <john.wie...@fu-berlin.de>
Subject Re: XmiCasDeserializer.deserialize with InputSource rather than InputStream
Date Mon, 23 Aug 2010 09:28:15 GMT
Hash: SHA1

Thanks all, for your input. I am pretty sure now that my problems stems
from my improper handling of array sizes that resulted in improper XML
syntax that choked the parser (SAXParseException).

> and then use a ByteArrayInputStream.

> Or, if you feel you have to store the file in memory before parsing,
> then at least store it in a ByteArray and not a CharArray.  Then you
> can feed the parser with an InputStream on the ByteArray, and avoid
> the encoding and byte-order problems that Marshall describes.

That was my initial implementation but had since away from that
following a reasoning I read online, a reasoning that do not seem to
make much sense to me anymore..

> It is a very bad idea to read an XML file and store it in a String.

Thanks for the advice. My default option is to read the XML files via a
FileInputStream as recommended. But I intended to store the XML in
memory while I develop before I deploy as a primitive way of caching.
This seemed useful, as my data (ca. 15k rather small articles each in
its own CAS serialized in its individual XMI) will be soon be located on
a network share. Increasingly bigger differing subsets of them will be
accessed over time.

I will rethink my approach and keep your advice in mind.

Thanks again,

Am 22.08.2010 22:48, schrieb Marshall Schor:
> I'm not an expert here, but I found by googling that at least one 
> person thinks it's a bad practice to read things into char arrays, 
> and then send those to an XML parser.
> The web page http://www.odi.ch/prog/design/newbies.php#7 says:
> It is a very bad idea to read an XML file and store it in a String. 
> An XML specifies its encoding in the XML header. But when reading a 
> file you have to know the encoding beforehand! Also storing an XML 
> file in a String wastes memory. All XML parsers accept an InputStream
> as a parsing source and they figure out the encoding themselves
> correctly. So you can feed them an InputStream instead of storing the
> whole file in memory temporarily. The byte order (big-endian,
> little-endian) is another trap when a multi-byte encoding (such as
> UTF-8) is used. XML files may carry a byte order mark at the
> beginning that specifies the byte order. XML parsers handle them
> correctly.
> -Marshall
> On 8/22/2010 8:52 AM, John Wiesel wrote:
>> Dear all,
>> I am currently stalled in my project by 
>> XmiCasDeserializer.deserialize: I am wondering why there is no 
>> method that allows to directly set up the XML parser with a 
>> InputSource instead of an InputStream. I would like to load my CAS
>>  from an XMI file that I have cached in a CharArray. As I cannot 
>> generate an InputStream from a String (StringBufferInputStream is 
>> deprecated since JDK 1.1) but should be able to do so using an 
>> InputSource w/o much trouble, I hope there is a sensible solution 
>> for this that I just haven't thought of yet.
>> Any suggestions? Thanks folks.
>> John
Version: GnuPG v2.0.14 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


View raw message