uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D. Heinze" <dhei...@gnoetics.com>
Subject RE: CAS serializationWithCompression
Date Wed, 13 Jan 2016 19:05:59 GMT
Found the problem by serializing the CAS to Json.  The CAS sofaText was
acting like a pushdown stack and accumulating the full text of each
successive document due to an input stream and buffer not getting properly
closed/cleared between iterations.

Thanks / Dan

-----Original Message-----
From: D. Heinze [mailto:dheinze@gnoetics.com] 
Sent: Tuesday, January 12, 2016 2:13 PM
To: user@uima.apache.org
Subject: RE: CAS serializationWithCompression

Thanks Marshall.  Will do.  I just completed upgrading from UIMA 2.6.0 to
2.8.1 just to make sure there were no issues there.  Will now get back to
the CAS serialization issue.  Yes, I've been trying to think of where there
could be retained junk that is getting added back into the CAS with each
iteration.

-Dan

-----Original Message-----
From: Marshall Schor [mailto:msa@schor.com]
Sent: Tuesday, January 12, 2016 11:56 AM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

hmmm, seems like unusual behavior.

It would help a lot to diagnose this if you could construct a small test
case - one which perhaps creates a cas, fills it with a bit of data, does
the compressed serialization, resets the cas, and loops and see if that
produces "expanding" serializations.

  -- if it does, please post the test case to a Jira and we'll diagnose /
fix this :-)

  -- if it doesn't, then you have to get closer to your actual use case and
iterate until you see what it is that you last added that starts making it
serialize ever-expanding instances.  That will be a big clue, I think.

-Marshall

On 1/12/2016 10:54 AM, D. Heinze wrote:
> The CAS.size() starts as larger than the serializedWithCompression 
> version, but eventually the serializedWithCompression version grows to 
> be larger than the CAS.size().
> The overall process is:
> * Create a new CAS
> * Read in an xml document and store the structure and content in the cas.
> * Tokenize and parse the document and store that info in the cas.
> * Run a number of lexical engines and ConceptMapper engines on the 
> data and store that info in the cas
> * Produce an xml document with the content of the original input 
> document marked up with the analysis results and both write that out 
> to a file and also store it in the cas
> * serializeWithCompression to a FileOutputStream
> * cas.reset()
> * iterate on the next input document
> All the work other than creating and cas.reset() is done using the JCas.
> Even though the output CASes keep getting larger, they seem to 
> deserialize just fine and are usable.
> Thanks/Dan
>
> -----Original Message-----
> From: Richard Eckart de Castilho [mailto:rec@apache.org]
> Sent: Tuesday, January 12, 2016 2:45 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
>
> Is the CAS.size() larger than the serialized version or smaller?
> What are you actually doing to the CAS? Just serializing/deserializing 
> a couple of times in a row, or do you actually add feature structures?
> The sample code you show doesn't give any hint about where the CAS 
> comes from and what is being done with it.
>
> -- Richard
>
>> On 12.01.2016, at 03:06, D. Heinze <dheinze@gnoetics.com> wrote:
>>
>> I'm having a problem with CAS serializationWithCompression.  I am 
>> processing a few million text document on an IBM P8 with 16 physical 
>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>>
>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>>
>> I use serializeWithCompression to save the final state of the 
>> processing on each document to a file for later processing.
>>
>> However, the size of the serialized CAS just keeps growing.  The size 
>> of the CAS is stable, but the serialized CASes just keep getting 
>> bigger. I even went to creating a new CAS for each process instead of 
>> using cas.reset().  I have also tried writing the serialized CAS to a 
>> byte array output stream first and then to a file, but it is the 
>> serializeWithCompression that caused the size problem not writing the
> file.
>> Here's what the code looks like.  Flushing or not flushing does not 
>> make a difference.  Closing or not closing the file output strem does 
>> not make a difference (other than leaking memory).  I've also tried 
>> doing serializeWithCompression with type filtering.  Wanted to try 
>> using a Marker, but cannot see how to do that.  The problem exists 
>> regardless of doing 1 or
>> 55 pipelines concurrently.
>>
>>
>>
>>        File fout = new File(documentPath);
>>
>>        fos = new FileOutputStream(fout);
>>
>>        
>> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
>> cas, fos);
>>
>>        fos.flush();
>>
>>        fos.close();
>>
>>        logger.info( "serializedCas size " + cas.size() + " ToFile " + 
>> documentPath);
>>
>>
>>
>> Suggestions will be appreciated.
>>
>>
>>
>> Thanks / Dan
>>
>>
>>
>


Mime
View raw message