uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D. Heinze" <dhei...@gnoetics.com>
Subject RE: CAS serializationWithCompression
Date Tue, 12 Jan 2016 15:54:06 GMT
The CAS.size() starts as larger than the serializedWithCompression version,
but eventually the serializedWithCompression version grows to be larger than
the CAS.size().
The overall process is:
* Create a new CAS
* Read in an xml document and store the structure and content in the cas.
* Tokenize and parse the document and store that info in the cas.
* Run a number of lexical engines and ConceptMapper engines on the data and
store that info in the cas
* Produce an xml document with the content of the original input document
marked up with the analysis results and both write that out to a file and
also store it in the cas
* serializeWithCompression to a FileOutputStream
* cas.reset()
* iterate on the next input document
All the work other than creating and cas.reset() is done using the JCas.
Even though the output CASes keep getting larger, they seem to deserialize
just fine and are usable.
Thanks/Dan

-----Original Message-----
From: Richard Eckart de Castilho [mailto:rec@apache.org] 
Sent: Tuesday, January 12, 2016 2:45 AM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

Is the CAS.size() larger than the serialized version or smaller?
What are you actually doing to the CAS? Just serializing/deserializing a
couple of times in a row, or do you actually add feature structures?
The sample code you show doesn't give any hint about where the CAS comes
from and what is being done with it.

-- Richard

> On 12.01.2016, at 03:06, D. Heinze <dheinze@gnoetics.com> wrote:
> 
> I'm having a problem with CAS serializationWithCompression.  I am 
> processing a few million text document on an IBM P8 with 16 physical 
> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
> 
> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
> 
> I use serializeWithCompression to save the final state of the 
> processing on each document to a file for later processing.
> 
> However, the size of the serialized CAS just keeps growing.  The size 
> of the CAS is stable, but the serialized CASes just keep getting 
> bigger. I even went to creating a new CAS for each process instead of 
> using cas.reset().  I have also tried writing the serialized CAS to a 
> byte array output stream first and then to a file, but it is the 
> serializeWithCompression that caused the size problem not writing the
file.
> 
> Here's what the code looks like.  Flushing or not flushing does not 
> make a difference.  Closing or not closing the file output strem does 
> not make a difference (other than leaking memory).  I've also tried 
> doing serializeWithCompression with type filtering.  Wanted to try 
> using a Marker, but cannot see how to do that.  The problem exists 
> regardless of doing 1 or
> 55 pipelines concurrently.
> 
> 
> 
>        File fout = new File(documentPath);
> 
>        fos = new FileOutputStream(fout);
> 
>        
> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
> cas, fos);
> 
>        fos.flush();
> 
>        fos.close();
> 
>        logger.info( "serializedCas size " + cas.size() + " ToFile " + 
> documentPath);
> 
> 
> 
> Suggestions will be appreciated.
> 
> 
> 
> Thanks / Dan
> 
> 
> 


Mime
View raw message