arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject C++ and Python size problems with Arrow 0.13.0
Date Wed, 03 Apr 2019 00:23:43 GMT
hi folks,

I that the arrow-cpp conda packages for Windows have ballooned in size
to nearly 140 megabytes for RC4

https://bintray.com/apache/arrow/python-rc/0.13.0-rc4#files/python-rc/0.13.0-rc4

Looking at one of these packages it seems the Windows static libraries
are huge -- I'm not sure why they are so big but we should probably
investigate

$ ll Library/lib/
total 741796
-rw-r--r-- 1 wesm wesm   1507048 Mar 27 23:34 arrow.lib
-rw-r--r-- 1 wesm wesm     76184 Mar 27 23:35 arrow_python.lib
-rw-r--r-- 1 wesm wesm  61322082 Mar 27 23:36 arrow_python_static.lib
-rw-r--r-- 1 wesm wesm 328090044 Mar 27 23:37 arrow_static.lib
drwxr-xr-x 3 wesm wesm      4096 Apr  2 19:12 cmake/
-rw-r--r-- 1 wesm wesm    302496 Mar 27 23:38 gandiva.lib
-rw-r--r-- 1 wesm wesm 239314018 Mar 27 23:40 gandiva_static.lib
-rw-r--r-- 1 wesm wesm    491292 Mar 27 23:41 parquet.lib
-rw-r--r-- 1 wesm wesm 128473780 Mar 27 23:42 parquet_static.lib
drwxr-xr-x 2 wesm wesm      4096 Apr  2 19:12 pkgconfig/

As a mitigating measure in the meantime, I would suggest that we stop
bundling the static libraries in the arrow-cpp conda package, since
we're just hurting release managers and users with a large package
download when they `conda install pyarrow`. Can someone open a JIRA
issue about this? If packaging the static libraries in conda is
something that people need then we could create a separate
arrow-cpp-static artifact

The production packages in conda-forge are a bit smaller (less than
100 MB), but still quite large.

https://anaconda.org/conda-forge/arrow-cpp/files

I noticed also that the wheel Python packages on Linux have become
quite large. The Python 3.7 wheel is 48.5 megabytes for example. The
expected culprit is libgandiva.so, where I see

-rwxr-xr-x 1 wesm wesm   131047 Apr  2 19:18 libarrow_boost_filesystem.so*
-rwxr-xr-x 1 wesm wesm   131047 Apr  2 19:18
libarrow_boost_filesystem.so.1.66.0*
-rwxr-xr-x 1 wesm wesm  1253641 Apr  2 19:18 libarrow_boost_regex.so*
-rwxr-xr-x 1 wesm wesm  1253641 Apr  2 19:18 libarrow_boost_regex.so.1.66.0*
-rwxr-xr-x 1 wesm wesm    30081 Apr  2 19:18 libarrow_boost_system.so*
-rwxr-xr-x 1 wesm wesm    30081 Apr  2 19:18 libarrow_boost_system.so.1.66.0*
-rwxr-xr-x 1 wesm wesm  1613712 Apr  2 19:18 libarrow_python.so*
-rwxr-xr-x 1 wesm wesm  1400561 Apr  2 19:18 libarrow_python.so.13*
-rwxr-xr-x 1 wesm wesm 12543416 Apr  2 19:18 libarrow.so*
-rwxr-xr-x 1 wesm wesm 11540172 Apr  2 19:18 libarrow.so.13*
-rw-r--r-- 1 wesm wesm  6393593 Apr  2 19:18 lib.cpp
-rwxr-xr-x 1 wesm wesm  2558504 Apr  2 19:18
lib.cpython-37m-x86_64-linux-gnu.so*
-rwxr-xr-x 1 wesm wesm 61260912 Apr  2 19:18 libgandiva.so*
-rwxr-xr-x 1 wesm wesm 57342916 Apr  2 19:18 libgandiva.so.13*
-rwxr-xr-x 1 wesm wesm  3567224 Apr  2 19:18 libparquet.so*
-rwxr-xr-x 1 wesm wesm  3035367 Apr  2 19:18 libparquet.so.13*
-rwxr-xr-x 1 wesm wesm   352440 Apr  2 19:18 libplasma.so*
-rwxr-xr-x 1 wesm wesm   315802 Apr  2 19:18 libplasma.so.13*

There's something very odd here, though, which is that libgandiva.so
and libgandiva.so.13 appear to be distinct. They have different
checksums, for example

(pyarrow-0.13.0-py37-test) 19:19 ~/Downloads/arrow-cpp-py36-vc14 $
sha256sum ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so
8f1026d7bf476b90a0cac8239947ad334ee91cd31a944102aff6e8a67ae973e8
/home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so
(pyarrow-0.13.0-py37-test) 19:21 ~/Downloads/arrow-cpp-py36-vc14 $
sha256sum ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13
9969a50787f8e0411115c0bfffccd3a350fde5f8c2f319acd72f3cf8097365dc
/home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13

That seems buggy to me. We might also investigate if there's a way to
trim the binary sizes in some way.

Thanks
Wes

Mime
View raw message