uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D. Heinze" <dhei...@gnoetics.com>
Subject RE: CAS serializationWithCompression
Date Wed, 13 Jan 2016 22:35:42 GMT
Yes.  That was the main reason I wanted to update from 2.6.0.  Being able to
examine the Json CAS, it took about half an hour to track down the problem.
If I had to hunt blind, it would have taken forever.  I had already profiled
all the code for real and potential memory leaks, but this was one that
didn't show up. 

Thanks / Dan

-----Original Message-----
From: Marshall Schor [mailto:msa@schor.com] 
Sent: Wednesday, January 13, 2016 2:27 PM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

Great!  Glad to see some use is being made of JSON :-). 


On 1/13/2016 2:05 PM, D. Heinze wrote:
> Found the problem by serializing the CAS to Json.  The CAS sofaText 
> was acting like a pushdown stack and accumulating the full text of 
> each successive document due to an input stream and buffer not getting 
> properly closed/cleared between iterations.
> Thanks / Dan
> -----Original Message-----
> From: D. Heinze [mailto:dheinze@gnoetics.com]
> Sent: Tuesday, January 12, 2016 2:13 PM
> To: user@uima.apache.org
> Subject: RE: CAS serializationWithCompression
> Thanks Marshall.  Will do.  I just completed upgrading from UIMA 2.6.0 
> to
> 2.8.1 just to make sure there were no issues there.  Will now get back 
> to the CAS serialization issue.  Yes, I've been trying to think of 
> where there could be retained junk that is getting added back into the 
> CAS with each iteration.
> -Dan
> -----Original Message-----
> From: Marshall Schor [mailto:msa@schor.com]
> Sent: Tuesday, January 12, 2016 11:56 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
> hmmm, seems like unusual behavior.
> It would help a lot to diagnose this if you could construct a small 
> test case - one which perhaps creates a cas, fills it with a bit of 
> data, does the compressed serialization, resets the cas, and loops and 
> see if that produces "expanding" serializations.
>   -- if it does, please post the test case to a Jira and we'll 
> diagnose / fix this :-)
>   -- if it doesn't, then you have to get closer to your actual use 
> case and iterate until you see what it is that you last added that 
> starts making it serialize ever-expanding instances.  That will be a big
clue, I think.
> -Marshall
> On 1/12/2016 10:54 AM, D. Heinze wrote:
>> The CAS.size() starts as larger than the serializedWithCompression 
>> version, but eventually the serializedWithCompression version grows 
>> to be larger than the CAS.size().
>> The overall process is:
>> * Create a new CAS
>> * Read in an xml document and store the structure and content in the cas.
>> * Tokenize and parse the document and store that info in the cas.
>> * Run a number of lexical engines and ConceptMapper engines on the 
>> data and store that info in the cas
>> * Produce an xml document with the content of the original input 
>> document marked up with the analysis results and both write that out 
>> to a file and also store it in the cas
>> * serializeWithCompression to a FileOutputStream
>> * cas.reset()
>> * iterate on the next input document
>> All the work other than creating and cas.reset() is done using the JCas.
>> Even though the output CASes keep getting larger, they seem to 
>> deserialize just fine and are usable.
>> Thanks/Dan
>> -----Original Message-----
>> From: Richard Eckart de Castilho [mailto:rec@apache.org]
>> Sent: Tuesday, January 12, 2016 2:45 AM
>> To: user@uima.apache.org
>> Subject: Re: CAS serializationWithCompression
>> Is the CAS.size() larger than the serialized version or smaller?
>> What are you actually doing to the CAS? Just 
>> serializing/deserializing a couple of times in a row, or do you actually
add feature structures?
>> The sample code you show doesn't give any hint about where the CAS 
>> comes from and what is being done with it.
>> -- Richard
>>> On 12.01.2016, at 03:06, D. Heinze <dheinze@gnoetics.com> wrote:
>>> I'm having a problem with CAS serializationWithCompression.  I am 
>>> processing a few million text document on an IBM P8 with 16 physical 
>>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>>> I use serializeWithCompression to save the final state of the 
>>> processing on each document to a file for later processing.
>>> However, the size of the serialized CAS just keeps growing.  The 
>>> size of the CAS is stable, but the serialized CASes just keep 
>>> getting bigger. I even went to creating a new CAS for each process 
>>> instead of using cas.reset().  I have also tried writing the 
>>> serialized CAS to a byte array output stream first and then to a 
>>> file, but it is the serializeWithCompression that caused the size 
>>> problem not writing the
>> file.
>>> Here's what the code looks like.  Flushing or not flushing does not 
>>> make a difference.  Closing or not closing the file output strem 
>>> does not make a difference (other than leaking memory).  I've also 
>>> tried doing serializeWithCompression with type filtering.  Wanted to 
>>> try using a Marker, but cannot see how to do that.  The problem 
>>> exists regardless of doing 1 or
>>> 55 pipelines concurrently.
>>>        File fout = new File(documentPath);
>>>        fos = new FileOutputStream(fout);
>>> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
>>> cas, fos);
>>>        fos.flush();
>>>        fos.close();
>>>        logger.info( "serializedCas size " + cas.size() + " ToFile " 
>>> + documentPath);
>>> Suggestions will be appreciated.
>>> Thanks / Dan

View raw message