From user-return-741-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Tue Nov  3 21:26:37 2020
Return-Path: <user-return-741-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 1DDA318064E
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  3 Nov 2020 22:26:37 +0100 (CET)
Received: from mail.apache.org (localhost [127.0.0.1])
	by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id 525EC1230E2
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  3 Nov 2020 21:26:36 +0000 (UTC)
Received: (qmail 64190 invoked by uid 500); 3 Nov 2020 21:26:35 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 64180 invoked by uid 99); 3 Nov 2020 21:26:35 -0000
Received: from ui-eu-02.ponee.io (HELO localhost) (116.202.110.96)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2020 21:26:35 +0000
In-Reply-To: <CAK7Z5T-b88omCdjm32L937QSETrF0UcLKWiArAFbFQ2EP9cyng@mail.gmail.com>
MIME-Version: 1.0
Date: Tue, 03 Nov 2020 21:26:34 -0000
References: <CAK7Z5T-b88omCdjm32L937QSETrF0UcLKWiArAFbFQ2EP9cyng@mail.gmail.com> <pony-0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33-9464bc597829bac6bc33c974efedc874d3c52223@user.arrow.apache.org>
Message-ID: <pony-0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33-d8a42ca0af6dcbe8804a01c3968e2728447eaa48@user.arrow.apache.org>
x-ponymail-sender: 0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33
To: <user@arrow.apache.org>
X-Mailer: LuaSocket 3.0-rc1
x-ponymail-agent: PonyMail Composer/0.2
Subject: Re: Best way to store ragged packet data in Parquet files
Content-Type: text/plain; charset=utf-8
From: Jason Sachs <jmsachs@gmail.com>


On 2020/11/03 20:49:46, Micah Kornfield <emkornfield@gmail.com> wrote: 
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet.  Using one column for timestamp and one
> column of bytes for the data is what I would try first.  If there is any
> structure to the packets splitting them into the structure could also help.
> 
> -Micah

For the test cases I have, >99% of the packets are the same length, so there's little-to-no benefit of removing the padding; the length field and zero padding barely adds anything once you factor compression into the mix.

I've tried use_dictionaries=False and that does help some.

But I'll post an updated example to back these statements up and see how much better I can get.

I'm just surprised that hdf5 does a better job in this case; maybe I don't understand the constraints the file format imposes on data compression.