Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6DBB718AB8 for ; Wed, 13 Jan 2016 22:36:06 +0000 (UTC) Received: (qmail 11958 invoked by uid 500); 13 Jan 2016 22:36:01 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 11912 invoked by uid 500); 13 Jan 2016 22:36:01 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 11899 invoked by uid 99); 13 Jan 2016 22:36:00 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jan 2016 22:36:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7AC7CC1F03 for ; Wed, 13 Jan 2016 22:36:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.09 X-Spam-Level: X-Spam-Status: No, score=0.09 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=gnoetics.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id P_m5L3wa1VU6 for ; Wed, 13 Jan 2016 22:35:51 +0000 (UTC) Received: from biz179.inmotionhosting.com (biz179.inmotionhosting.com [205.134.250.52]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 5B79142ADC for ; Wed, 13 Jan 2016 22:35:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnoetics.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:To:From; bh=rTEVvQ4+DTsDJEfm3Cp4oy5aoCsGqRIzT6ikmmoZPUQ=; b=F6SZxSn04cUqQJxUYmMSHTlOp6Lv/Pg2vULI7gxUqgCnGHvGC8qHdgHyPdx9N7bjzr+9SK1AMhUzX/0oUEIosgtdwzwEOErX4h4VoS0sjiYut9kaqteRZIo2U4+EhPwev+u8dY683uDVNDt9I07sT7kSnYvpTlU5ruj/QEFeLivSSFW7yyCwOYESB56PQ5HXTD/evt+a7uCpZaKm9Bt8/viCh1DvsiScuUxZ0UkhfeJrwRyQ0uTI9Noo0azjtfrs3ulcivvt7mCh8rCx6M0H2fHBPGBQ1EBWbmRzU9Ux1CvBCTd6AJmL3hNgpiXNsoGGHm0Dc2lYgnA9eB0/oFCVUw==; Received: from cpe-66-27-69-255.san.res.rr.com ([66.27.69.255]:52181 helo=Antec2) by biz179.inmotionhosting.com with esmtp (Exim 4.85) (envelope-from ) id 1aJU10-000Igs-Qx for user@uima.apache.org; Wed, 13 Jan 2016 14:35:44 -0800 From: "D. Heinze" To: References: <0a3601d14cdd$d578c440$806a4cc0$@gnoetics.com> <881F0CB3-1365-4AE4-BFD5-0229138458E4@apache.org> <0acb01d14d51$77c25810$67470830$@gnoetics.com> <56955A38.8050102@schor.com> <0b2e01d14d86$6d7e3880$487aa980$@gnoetics.com> <0c1f01d14e35$70c5a7d0$5250f770$@gnoetics.com> <5696CF3D.6070405@schor.com> In-Reply-To: <5696CF3D.6070405@schor.com> Subject: RE: CAS serializationWithCompression Date: Wed, 13 Jan 2016 14:35:42 -0800 Organization: Gnoetics, Inc. Message-ID: <0c9001d14e52$bc923bc0$35b6b340$@gnoetics.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQHPnreCMvhREve7kn7+xlX8t6fRDQICSW6NAZLo1RsCFErdrADj+bKLAZqHIMsByrqH/Z6t0DRw Content-Language: en-us X-OutGoing-Spam-Status: No, score=-2.9 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - biz179.inmotionhosting.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - gnoetics.com X-Get-Message-Sender-Via: biz179.inmotionhosting.com: acl_c_relayhosts_text_entry: dheinze@gnoetics.com|gnoetics.com Yes. That was the main reason I wanted to update from 2.6.0. Being able to examine the Json CAS, it took about half an hour to track down the problem. If I had to hunt blind, it would have taken forever. I had already profiled all the code for real and potential memory leaks, but this was one that didn't show up. Thanks / Dan -----Original Message----- From: Marshall Schor [mailto:msa@schor.com] Sent: Wednesday, January 13, 2016 2:27 PM To: user@uima.apache.org Subject: Re: CAS serializationWithCompression Great! Glad to see some use is being made of JSON :-). -Marshall On 1/13/2016 2:05 PM, D. Heinze wrote: > Found the problem by serializing the CAS to Json. The CAS sofaText > was acting like a pushdown stack and accumulating the full text of > each successive document due to an input stream and buffer not getting > properly closed/cleared between iterations. > > Thanks / Dan > > -----Original Message----- > From: D. Heinze [mailto:dheinze@gnoetics.com] > Sent: Tuesday, January 12, 2016 2:13 PM > To: user@uima.apache.org > Subject: RE: CAS serializationWithCompression > > Thanks Marshall. Will do. I just completed upgrading from UIMA 2.6.0 > to > 2.8.1 just to make sure there were no issues there. Will now get back > to the CAS serialization issue. Yes, I've been trying to think of > where there could be retained junk that is getting added back into the > CAS with each iteration. > > -Dan > > -----Original Message----- > From: Marshall Schor [mailto:msa@schor.com] > Sent: Tuesday, January 12, 2016 11:56 AM > To: user@uima.apache.org > Subject: Re: CAS serializationWithCompression > > hmmm, seems like unusual behavior. > > It would help a lot to diagnose this if you could construct a small > test case - one which perhaps creates a cas, fills it with a bit of > data, does the compressed serialization, resets the cas, and loops and > see if that produces "expanding" serializations. > > -- if it does, please post the test case to a Jira and we'll > diagnose / fix this :-) > > -- if it doesn't, then you have to get closer to your actual use > case and iterate until you see what it is that you last added that > starts making it serialize ever-expanding instances. That will be a big clue, I think. > > -Marshall > > On 1/12/2016 10:54 AM, D. Heinze wrote: >> The CAS.size() starts as larger than the serializedWithCompression >> version, but eventually the serializedWithCompression version grows >> to be larger than the CAS.size(). >> The overall process is: >> * Create a new CAS >> * Read in an xml document and store the structure and content in the cas. >> * Tokenize and parse the document and store that info in the cas. >> * Run a number of lexical engines and ConceptMapper engines on the >> data and store that info in the cas >> * Produce an xml document with the content of the original input >> document marked up with the analysis results and both write that out >> to a file and also store it in the cas >> * serializeWithCompression to a FileOutputStream >> * cas.reset() >> * iterate on the next input document >> All the work other than creating and cas.reset() is done using the JCas. >> Even though the output CASes keep getting larger, they seem to >> deserialize just fine and are usable. >> Thanks/Dan >> >> -----Original Message----- >> From: Richard Eckart de Castilho [mailto:rec@apache.org] >> Sent: Tuesday, January 12, 2016 2:45 AM >> To: user@uima.apache.org >> Subject: Re: CAS serializationWithCompression >> >> Is the CAS.size() larger than the serialized version or smaller? >> What are you actually doing to the CAS? Just >> serializing/deserializing a couple of times in a row, or do you actually add feature structures? >> The sample code you show doesn't give any hint about where the CAS >> comes from and what is being done with it. >> >> -- Richard >> >>> On 12.01.2016, at 03:06, D. Heinze wrote: >>> >>> I'm having a problem with CAS serializationWithCompression. I am >>> processing a few million text document on an IBM P8 with 16 physical >>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. >>> >>> I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. >>> >>> I use serializeWithCompression to save the final state of the >>> processing on each document to a file for later processing. >>> >>> However, the size of the serialized CAS just keeps growing. The >>> size of the CAS is stable, but the serialized CASes just keep >>> getting bigger. I even went to creating a new CAS for each process >>> instead of using cas.reset(). I have also tried writing the >>> serialized CAS to a byte array output stream first and then to a >>> file, but it is the serializeWithCompression that caused the size >>> problem not writing the >> file. >>> Here's what the code looks like. Flushing or not flushing does not >>> make a difference. Closing or not closing the file output strem >>> does not make a difference (other than leaking memory). I've also >>> tried doing serializeWithCompression with type filtering. Wanted to >>> try using a Marker, but cannot see how to do that. The problem >>> exists regardless of doing 1 or >>> 55 pipelines concurrently. >>> >>> >>> >>> File fout = new File(documentPath); >>> >>> fos = new FileOutputStream(fout); >>> >>> >>> org.apache.uima.cas.impl.Serialization.serializeWithCompression( >>> cas, fos); >>> >>> fos.flush(); >>> >>> fos.close(); >>> >>> logger.info( "serializedCas size " + cas.size() + " ToFile " >>> + documentPath); >>> >>> >>> >>> Suggestions will be appreciated. >>> >>> >>> >>> Thanks / Dan >>> >>> >>> >