uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: CAS serializationWithCompression
Date Tue, 12 Jan 2016 19:55:36 GMT
hmmm, seems like unusual behavior.

It would help a lot to diagnose this if you could construct a small test case -
one which
perhaps creates a cas, fills it with a bit of data, does the compressed
serialization, resets the cas, and loops
and see if that produces "expanding" serializations.

  -- if it does, please post the test case to a Jira and we'll diagnose / fix
this :-)

  -- if it doesn't, then you have to get closer to your actual use case and
iterate until you see what it is that you last added that starts making it
serialize ever-expanding instances.  That will be a big clue, I think.


On 1/12/2016 10:54 AM, D. Heinze wrote:
> The CAS.size() starts as larger than the serializedWithCompression version,
> but eventually the serializedWithCompression version grows to be larger than
> the CAS.size().
> The overall process is:
> * Create a new CAS
> * Read in an xml document and store the structure and content in the cas.
> * Tokenize and parse the document and store that info in the cas.
> * Run a number of lexical engines and ConceptMapper engines on the data and
> store that info in the cas
> * Produce an xml document with the content of the original input document
> marked up with the analysis results and both write that out to a file and
> also store it in the cas
> * serializeWithCompression to a FileOutputStream
> * cas.reset()
> * iterate on the next input document
> All the work other than creating and cas.reset() is done using the JCas.
> Even though the output CASes keep getting larger, they seem to deserialize
> just fine and are usable.
> Thanks/Dan
> -----Original Message-----
> From: Richard Eckart de Castilho [mailto:rec@apache.org] 
> Sent: Tuesday, January 12, 2016 2:45 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
> Is the CAS.size() larger than the serialized version or smaller?
> What are you actually doing to the CAS? Just serializing/deserializing a
> couple of times in a row, or do you actually add feature structures?
> The sample code you show doesn't give any hint about where the CAS comes
> from and what is being done with it.
> -- Richard
>> On 12.01.2016, at 03:06, D. Heinze <dheinze@gnoetics.com> wrote:
>> I'm having a problem with CAS serializationWithCompression.  I am 
>> processing a few million text document on an IBM P8 with 16 physical 
>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>> I use serializeWithCompression to save the final state of the 
>> processing on each document to a file for later processing.
>> However, the size of the serialized CAS just keeps growing.  The size 
>> of the CAS is stable, but the serialized CASes just keep getting 
>> bigger. I even went to creating a new CAS for each process instead of 
>> using cas.reset().  I have also tried writing the serialized CAS to a 
>> byte array output stream first and then to a file, but it is the 
>> serializeWithCompression that caused the size problem not writing the
> file.
>> Here's what the code looks like.  Flushing or not flushing does not 
>> make a difference.  Closing or not closing the file output strem does 
>> not make a difference (other than leaking memory).  I've also tried 
>> doing serializeWithCompression with type filtering.  Wanted to try 
>> using a Marker, but cannot see how to do that.  The problem exists 
>> regardless of doing 1 or
>> 55 pipelines concurrently.
>>        File fout = new File(documentPath);
>>        fos = new FileOutputStream(fout);
>> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
>> cas, fos);
>>        fos.flush();
>>        fos.close();
>>        logger.info( "serializedCas size " + cas.size() + " ToFile " + 
>> documentPath);
>> Suggestions will be appreciated.
>> Thanks / Dan

View raw message