Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2EEF518446 for ; Tue, 12 Jan 2016 02:06:33 +0000 (UTC) Received: (qmail 42178 invoked by uid 500); 12 Jan 2016 02:06:33 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 42129 invoked by uid 500); 12 Jan 2016 02:06:32 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 42112 invoked by uid 99); 12 Jan 2016 02:06:32 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2016 02:06:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DCFE2180251 for ; Tue, 12 Jan 2016 02:06:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.089 X-Spam-Level: *** X-Spam-Status: No, score=3.089 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, T_DKIM_INVALID=0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=gnoetics.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id bhyWNEk3fC_S for ; Tue, 12 Jan 2016 02:06:30 +0000 (UTC) Received: from biz179.inmotionhosting.com (biz179.inmotionhosting.com [205.134.250.52]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id ABF7E43E72 for ; Tue, 12 Jan 2016 02:06:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnoetics.com; s=default; h=Content-Type:MIME-Version:Message-ID:Date:Subject:To:From; bh=q1XIY5ftrl6F7u1MsfiTQOgCbhWibT88TJragIqMpuI=; b=HtlymJkN/0j1VTZVh4ARPYWOcs3g+Us48rtv+iUXStkkOaeXHw8OHbgvvLh8TdWbOczC91Wh+1jwcPpWqyjaiWBLc9Bz+gea1kkPUh4EbrG5qxsmUquWnSY/uvLMi5U+UuT0QJdxAYS+vOU+lxrYcYCNyP5SwPIPTa6Jal+VeLoE5fogcuFWPeIzOsw9nith1ppcV9NomCsMeexF4QHJVDObI9PkNKe6uWZOd/s0fHlLSsNnbhAyygYxJWExL4YAOrNO2BORp53eZJN7n9yTztM5odZcPw6BAukSdlcri96iXFKJD6oNULfmH9coVqCtLjW6WxnT7hznKczW7fvYZA==; Received: from cpe-66-27-69-255.san.res.rr.com ([66.27.69.255]:63192 helo=Antec2) by biz179.inmotionhosting.com with esmtp (Exim 4.85) (envelope-from ) id 1aIoLl-000YF6-18 for user@uima.apache.org; Mon, 11 Jan 2016 18:06:22 -0800 From: "D. Heinze" To: Subject: CAS serializationWithCompression Date: Mon, 11 Jan 2016 18:06:22 -0800 Organization: Gnoetics, Inc. Message-ID: <0a3601d14cdd$d578c440$806a4cc0$@gnoetics.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0A37_01D14C9A.C7570AE0" X-Mailer: Microsoft Outlook 14.0 Thread-Index: AdFMvg+SaG7PoqEiSPmxUJbMi3YSLw== Content-Language: en-us X-OutGoing-Spam-Status: No, score=-2.9 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - biz179.inmotionhosting.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - gnoetics.com X-Get-Message-Sender-Via: biz179.inmotionhosting.com: acl_c_relayhosts_text_entry: dheinze@gnoetics.com|gnoetics.com ------=_NextPart_000_0A37_01D14C9A.C7570AE0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit I'm having a problem with CAS serializationWithCompression. I am processing a few million text document on an IBM P8 with 16 physical SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. I use serializeWithCompression to save the final state of the processing on each document to a file for later processing. However, the size of the serialized CAS just keeps growing. The size of the CAS is stable, but the serialized CASes just keep getting bigger. I even went to creating a new CAS for each process instead of using cas.reset(). I have also tried writing the serialized CAS to a byte array output stream first and then to a file, but it is the serializeWithCompression that caused the size problem not writing the file. Here's what the code looks like. Flushing or not flushing does not make a difference. Closing or not closing the file output strem does not make a difference (other than leaking memory). I've also tried doing serializeWithCompression with type filtering. Wanted to try using a Marker, but cannot see how to do that. The problem exists regardless of doing 1 or 55 pipelines concurrently. File fout = new File(documentPath); fos = new FileOutputStream(fout); org.apache.uima.cas.impl.Serialization.serializeWithCompression( cas, fos); fos.flush(); fos.close(); logger.info( "serializedCas size " + cas.size() + " ToFile " + documentPath); Suggestions will be appreciated. Thanks / Dan ------=_NextPart_000_0A37_01D14C9A.C7570AE0--