arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Re: Compressing parquet metadata?
Date Wed, 04 Nov 2020 19:39:52 GMT
Yes. Ouch, so there's a 4/3 hit there for base64. (is that always the case or does it use plaintext
if possible?)

I'm trying to figure out what kind of request to file in the issue tracker to help support
my use case. (data logging)

I have enough stuff I want to put in metadata that the use of compression matters to me. (for
one file it's not so bad, but we generate so many data log files with different metadata that
in aggregate it does matter.) An alternative that I might need to pursue is having a .zip
file containing one or more parquet files along with some metadata files; then the barrier
to using compression is fairly low.... but I'd like to avoid the complexity overhead of that.

If it does make sense to keep any compression as a manual feature, would it be reasonable
to ask for the compression mechanism of Parquet as a user-exposed feature? It is a fairly
nice interface (at least from the Python bindings) where as a user, all I care about on the
compression side is specifying the compression method and the compression level, and the Parquet
library takes care of using the correct algorithm; then on the decompression level it does
everything based on what it stored in the file. (in other words, binary COMPRESSED_BLOB =
compress(binary BLOB, string compression, int compression_level) and binary BLOB = uncompress(binary
COMPRESSED_BLOB) -- I can't seem to find an equivalent in Python to do this for standalone
usage.)

On 2020/11/04 16:41:00, Wes McKinney <wesmckinn@gmail.com> wrote: 
>  You mean the key-value metadata at the schema/field-level? That can
> be binary (it gets base64-encoded when written to Parquet)
> 
> On Wed, Nov 4, 2020 at 10:22 AM Jason Sachs <jmsachs@gmail.com> wrote:
> >
> > OK. If I take the manual approach, do parquet / arrow care whether metadata is binary
or not?
> >
> > On 2020/11/04 14:16:37, Wes McKinney <wesmckinn@gmail.com> wrote:
> > > There is not to my knowledge.
> > >
> > > On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs <jmsachs@gmail.com> wrote:
> > > >
> > > > Is there any built-in method to compress parquet metadata? From what I
can tell, the main table columns are compressed, but not the metadata.
> > > >
> > > > I have metadata which includes 100-200KB of text (JSON format) that is
easily compressible... is there any alternative to doing it myself?
> > >
> 

Mime
View raw message