Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3116118148 for ; Tue, 12 Jan 2016 15:54:37 +0000 (UTC) Received: (qmail 30730 invoked by uid 500); 12 Jan 2016 15:54:37 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 30682 invoked by uid 500); 12 Jan 2016 15:54:36 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 30669 invoked by uid 99); 12 Jan 2016 15:54:36 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2016 15:54:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 13F191804FC for ; Tue, 12 Jan 2016 15:54:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.11 X-Spam-Level: X-Spam-Status: No, score=0.11 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, SPF_HELO_PASS=-0.001, T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=gnoetics.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tK0rEJG5ei_3 for ; Tue, 12 Jan 2016 15:54:28 +0000 (UTC) Received: from biz179.inmotionhosting.com (biz179.inmotionhosting.com [205.134.250.52]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 80D6E23202 for ; Tue, 12 Jan 2016 15:54:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnoetics.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:To:From; bh=O20kX1iyYB2kZjNfsiR45+aGeplT9R3dj9lxTwhmosU=; b=R6MuDmxJTAOr+QDOFobvMvtvVHXCIEC2OpUN034tPRAe0SYPM9BHNRPsiPbkBLzJPNNh2y6fRjLUlHjfaViTPXbALECSdivuatpx0L3pe2ncsKdJcTEsjO2tSdgYE/0DdL+P/I8dddS3t1wQNnYPn+YSLbAlyWmzxEJdbRa56NfzyuPW+XW2jlFe/jdZonIHUW3OKfNOqBgVtfOza7IOLfZmYCLGlyINOVrphQeI8tIRY2zr/2GzNFb9xgGJL1tkPpfJ671x0Wmxx1eiWwRuNyb4KZQCoLGwwsJqdJJvSXeuF247O5VJJZB5mqvJotiwyqsyhQtoAjC4Lo6CXDtEDQ==; Received: from cpe-66-27-69-255.san.res.rr.com ([66.27.69.255]:50847 helo=Antec2) by biz179.inmotionhosting.com with esmtp (Exim 4.85) (envelope-from ) id 1aJ1Gm-00066R-Vh for user@uima.apache.org; Tue, 12 Jan 2016 07:54:06 -0800 From: "D. Heinze" To: References: <0a3601d14cdd$d578c440$806a4cc0$@gnoetics.com> <881F0CB3-1365-4AE4-BFD5-0229138458E4@apache.org> In-Reply-To: <881F0CB3-1365-4AE4-BFD5-0229138458E4@apache.org> Subject: RE: CAS serializationWithCompression Date: Tue, 12 Jan 2016 07:54:06 -0800 Organization: Gnoetics, Inc. Message-ID: <0acb01d14d51$77c25810$67470830$@gnoetics.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQHPnreCMvhREve7kn7+xlX8t6fRDQICSW6NnutPqMA= Content-Language: en-us X-OutGoing-Spam-Status: No, score=-2.9 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - biz179.inmotionhosting.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - gnoetics.com X-Get-Message-Sender-Via: biz179.inmotionhosting.com: acl_c_relayhosts_text_entry: dheinze@gnoetics.com|gnoetics.com The CAS.size() starts as larger than the serializedWithCompression version, but eventually the serializedWithCompression version grows to be larger than the CAS.size(). The overall process is: * Create a new CAS * Read in an xml document and store the structure and content in the cas. * Tokenize and parse the document and store that info in the cas. * Run a number of lexical engines and ConceptMapper engines on the data and store that info in the cas * Produce an xml document with the content of the original input document marked up with the analysis results and both write that out to a file and also store it in the cas * serializeWithCompression to a FileOutputStream * cas.reset() * iterate on the next input document All the work other than creating and cas.reset() is done using the JCas. Even though the output CASes keep getting larger, they seem to deserialize just fine and are usable. Thanks/Dan -----Original Message----- From: Richard Eckart de Castilho [mailto:rec@apache.org] Sent: Tuesday, January 12, 2016 2:45 AM To: user@uima.apache.org Subject: Re: CAS serializationWithCompression Is the CAS.size() larger than the serialized version or smaller? What are you actually doing to the CAS? Just serializing/deserializing a couple of times in a row, or do you actually add feature structures? The sample code you show doesn't give any hint about where the CAS comes from and what is being done with it. -- Richard > On 12.01.2016, at 03:06, D. Heinze wrote: > > I'm having a problem with CAS serializationWithCompression. I am > processing a few million text document on an IBM P8 with 16 physical > SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. > > I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. > > I use serializeWithCompression to save the final state of the > processing on each document to a file for later processing. > > However, the size of the serialized CAS just keeps growing. The size > of the CAS is stable, but the serialized CASes just keep getting > bigger. I even went to creating a new CAS for each process instead of > using cas.reset(). I have also tried writing the serialized CAS to a > byte array output stream first and then to a file, but it is the > serializeWithCompression that caused the size problem not writing the file. > > Here's what the code looks like. Flushing or not flushing does not > make a difference. Closing or not closing the file output strem does > not make a difference (other than leaking memory). I've also tried > doing serializeWithCompression with type filtering. Wanted to try > using a Marker, but cannot see how to do that. The problem exists > regardless of doing 1 or > 55 pipelines concurrently. > > > > File fout = new File(documentPath); > > fos = new FileOutputStream(fout); > > > org.apache.uima.cas.impl.Serialization.serializeWithCompression( > cas, fos); > > fos.flush(); > > fos.close(); > > logger.info( "serializedCas size " + cas.size() + " ToFile " + > documentPath); > > > > Suggestions will be appreciated. > > > > Thanks / Dan > > >