arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: File size PyArrow/Parquet
Date Tue, 25 Feb 2020 22:52:12 GMT
It depends a lot on the file. Parquet's encoding strategy is a lot
different from a gzipped CSV, there are some cases where a Parquet
file will be 10x smaller than a csv.gz and other cases where the
Parquet file will be larger. The Parquet file's metadata could give
you an idea of how large each compressed column chunk is, which might
give an idea which columns are compressing poorly. For example

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('/home/wesm/code/arrow/cpp/submodules/parquet-testing/data/alltypes_plain.parquet')

In [3]: for i in range(pf.metadata.num_row_groups):
   ...:     for j in range(pf.metadata.num_columns):
   ...:         col = pf.metadata.row_group(i).column(j)
   ...:         print("row group {} column {} compressed size
{}".format(i, j, col.total_compressed_size))
   ...:
row group 0 column 0 compressed size 73
row group 0 column 1 compressed size 24
row group 0 column 2 compressed size 47
row group 0 column 3 compressed size 47
row group 0 column 4 compressed size 47
row group 0 column 5 compressed size 55
row group 0 column 6 compressed size 47
row group 0 column 7 compressed size 55
row group 0 column 8 compressed size 88
row group 0 column 9 compressed size 49
row group 0 column 10 compressed size 13

On Tue, Feb 25, 2020 at 4:33 PM Samrat Batth <samratbatth@gmail.com> wrote:
>
> I am a new pyarrow/parquet user.
>
> I ran the following test:
> - 18mb zipped csv file (approx 1.5 mil rows) which has data for one month
> - saved it as parquet file partitioned on date with default compression and see the parquet
file size at ~45mb. If I don’t partition on date then the file size is ~30mb.
>
> My expectation was that the parquet file size would be less than zipped csv file - any
comments?
> Thx
>

Mime
View raw message