From user-return-741-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Nov 3 21:26:37 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 1DDA318064E for ; Tue, 3 Nov 2020 22:26:37 +0100 (CET) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id 525EC1230E2 for ; Tue, 3 Nov 2020 21:26:36 +0000 (UTC) Received: (qmail 64190 invoked by uid 500); 3 Nov 2020 21:26:35 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 64180 invoked by uid 99); 3 Nov 2020 21:26:35 -0000 Received: from ui-eu-02.ponee.io (HELO localhost) (116.202.110.96) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2020 21:26:35 +0000 In-Reply-To: MIME-Version: 1.0 Date: Tue, 03 Nov 2020 21:26:34 -0000 References: Message-ID: x-ponymail-sender: 0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33 To: X-Mailer: LuaSocket 3.0-rc1 x-ponymail-agent: PonyMail Composer/0.2 Subject: Re: Best way to store ragged packet data in Parquet files Content-Type: text/plain; charset=utf-8 From: Jason Sachs On 2020/11/03 20:49:46, Micah Kornfield wrote: > Hi Jason, > At least as a first pass I would try to avoid the padding and storing the > length separately in Parquet. Using one column for timestamp and one > column of bytes for the data is what I would try first. If there is any > structure to the packets splitting them into the structure could also help. > > -Micah For the test cases I have, >99% of the packets are the same length, so there's little-to-no benefit of removing the padding; the length field and zero padding barely adds anything once you factor compression into the mix. I've tried use_dictionaries=False and that does help some. But I'll post an updated example to back these statements up and see how much better I can get. I'm just surprised that hdf5 does a better job in this case; maybe I don't understand the constraints the file format imposes on data compression.