Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 62BB918A15 for ; Tue, 12 Jan 2016 22:13:22 +0000 (UTC) Received: (qmail 65222 invoked by uid 500); 12 Jan 2016 22:13:22 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 65175 invoked by uid 500); 12 Jan 2016 22:13:22 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 65162 invoked by uid 99); 12 Jan 2016 22:13:21 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2016 22:13:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4C83FC03E1 for ; Tue, 12 Jan 2016 22:13:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.09 X-Spam-Level: X-Spam-Status: No, score=0.09 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=gnoetics.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id oXjAL9k0if3a for ; Tue, 12 Jan 2016 22:13:13 +0000 (UTC) Received: from biz179.inmotionhosting.com (biz179.inmotionhosting.com [205.134.250.52]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 4AA92439C4 for ; Tue, 12 Jan 2016 22:13:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnoetics.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:To:From; bh=NV3V23l0VzK9FhzjEYI9KlMYT2wQZPsYgCZxt1ivjpM=; b=tqLYEbEs3EG5LZ4W/6ZvmmK5sHnDTkmrhHDamucQlioSlJEuXxyJ9offZHjZApge9l+lAg1m3D+VFqbhkrrH4kIE+C3n/Yythmxs5HVXLjZhsKp3lre/SH37Kc6qvfyvuitHw5d7md/3sAgyf1kbAwVr1oH/xykXUuBfD9SskMFywGGYpd/bZh7j381HJw4YJ/ogC2xEBwA8e2HTiMTQD477U/bnceT9MoRDlii494MTn/iz+e3QRtgCTOHdtUv3o6haMLC0gnXfxq0oNpUtHSvVux6n3uZWeYWBO14yFyJkyvl+mRwMYdC3nh3vDMdKyAagE17ZPCVWV28ul0wqCA==; Received: from cpe-66-27-69-255.san.res.rr.com ([66.27.69.255]:57626 helo=Antec2) by biz179.inmotionhosting.com with esmtp (Exim 4.85) (envelope-from ) id 1aJ7Be-0003Kb-SD for user@uima.apache.org; Tue, 12 Jan 2016 14:13:12 -0800 From: "D. Heinze" To: References: <0a3601d14cdd$d578c440$806a4cc0$@gnoetics.com> <881F0CB3-1365-4AE4-BFD5-0229138458E4@apache.org> <0acb01d14d51$77c25810$67470830$@gnoetics.com> <56955A38.8050102@schor.com> In-Reply-To: <56955A38.8050102@schor.com> Subject: RE: CAS serializationWithCompression Date: Tue, 12 Jan 2016 14:13:12 -0800 Organization: Gnoetics, Inc. Message-ID: <0b2e01d14d86$6d7e3880$487aa980$@gnoetics.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQHPnreCMvhREve7kn7+xlX8t6fRDQICSW6NAZLo1RsCFErdrJ7Ogulg Content-Language: en-us X-OutGoing-Spam-Status: No, score=-2.9 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - biz179.inmotionhosting.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - gnoetics.com X-Get-Message-Sender-Via: biz179.inmotionhosting.com: acl_c_relayhosts_text_entry: dheinze@gnoetics.com|gnoetics.com Thanks Marshall. Will do. I just completed upgrading from UIMA 2.6.0 to 2.8.1 just to make sure there were no issues there. Will now get back to the CAS serialization issue. Yes, I've been trying to think of where there could be retained junk that is getting added back into the CAS with each iteration. -Dan -----Original Message----- From: Marshall Schor [mailto:msa@schor.com] Sent: Tuesday, January 12, 2016 11:56 AM To: user@uima.apache.org Subject: Re: CAS serializationWithCompression hmmm, seems like unusual behavior. It would help a lot to diagnose this if you could construct a small test case - one which perhaps creates a cas, fills it with a bit of data, does the compressed serialization, resets the cas, and loops and see if that produces "expanding" serializations. -- if it does, please post the test case to a Jira and we'll diagnose / fix this :-) -- if it doesn't, then you have to get closer to your actual use case and iterate until you see what it is that you last added that starts making it serialize ever-expanding instances. That will be a big clue, I think. -Marshall On 1/12/2016 10:54 AM, D. Heinze wrote: > The CAS.size() starts as larger than the serializedWithCompression > version, but eventually the serializedWithCompression version grows to > be larger than the CAS.size(). > The overall process is: > * Create a new CAS > * Read in an xml document and store the structure and content in the cas. > * Tokenize and parse the document and store that info in the cas. > * Run a number of lexical engines and ConceptMapper engines on the > data and store that info in the cas > * Produce an xml document with the content of the original input > document marked up with the analysis results and both write that out > to a file and also store it in the cas > * serializeWithCompression to a FileOutputStream > * cas.reset() > * iterate on the next input document > All the work other than creating and cas.reset() is done using the JCas. > Even though the output CASes keep getting larger, they seem to > deserialize just fine and are usable. > Thanks/Dan > > -----Original Message----- > From: Richard Eckart de Castilho [mailto:rec@apache.org] > Sent: Tuesday, January 12, 2016 2:45 AM > To: user@uima.apache.org > Subject: Re: CAS serializationWithCompression > > Is the CAS.size() larger than the serialized version or smaller? > What are you actually doing to the CAS? Just serializing/deserializing > a couple of times in a row, or do you actually add feature structures? > The sample code you show doesn't give any hint about where the CAS > comes from and what is being done with it. > > -- Richard > >> On 12.01.2016, at 03:06, D. Heinze wrote: >> >> I'm having a problem with CAS serializationWithCompression. I am >> processing a few million text document on an IBM P8 with 16 physical >> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. >> >> I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. >> >> I use serializeWithCompression to save the final state of the >> processing on each document to a file for later processing. >> >> However, the size of the serialized CAS just keeps growing. The size >> of the CAS is stable, but the serialized CASes just keep getting >> bigger. I even went to creating a new CAS for each process instead of >> using cas.reset(). I have also tried writing the serialized CAS to a >> byte array output stream first and then to a file, but it is the >> serializeWithCompression that caused the size problem not writing the > file. >> Here's what the code looks like. Flushing or not flushing does not >> make a difference. Closing or not closing the file output strem does >> not make a difference (other than leaking memory). I've also tried >> doing serializeWithCompression with type filtering. Wanted to try >> using a Marker, but cannot see how to do that. The problem exists >> regardless of doing 1 or >> 55 pipelines concurrently. >> >> >> >> File fout = new File(documentPath); >> >> fos = new FileOutputStream(fout); >> >> >> org.apache.uima.cas.impl.Serialization.serializeWithCompression( >> cas, fos); >> >> fos.flush(); >> >> fos.close(); >> >> logger.info( "serializedCas size " + cas.size() + " ToFile " + >> documentPath); >> >> >> >> Suggestions will be appreciated. >> >> >> >> Thanks / Dan >> >> >> >