arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haowei Yu (Jira)" <>
Subject [jira] [Created] (ARROW-6776) [Python] Need a lite version of pyarrow
Date Thu, 03 Oct 2019 01:09:00 GMT
Haowei Yu created ARROW-6776:

             Summary: [Python] Need a lite version of pyarrow
                 Key: ARROW-6776
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.14.1
            Reporter: Haowei Yu

Currently I am building a library packages on top of pyarrow, so I include pyarrow as a dependency
and ship it to our customer. However, when our customer installed our packages, it will also
install pyarrow and pyarrow's dependency (numpy). However the dependency size is huge. 
(py36env) [hyu@c6x64-hyu-newuser-final-clone connector]$ ls -l --block-size=M /home/hyu/py36env/lib/python3.6/site-packages/pyarrow/

total 186M
 And numpy is around 80MB. Total is more than 250 MB.

Our customer want to bundle all dependency and run the code inside AWS Lambda, however they
hit the size limit and failed to run the code.

Looking into the pyarrow, I saw multiple .so files are shipped both with and without version
suffix, I wonder if you can remove the one of them (either with or without suffix), it will
at least reduce the package size by half.

Further, our library just want to use IPC and read data as record batch, I don't need arrow
flight at all (which is the biggest .so file and takes around 100 MB). I wonder if you can
push a lite version of the pyarrow so that I can specify lite version as the dependency. Or
maybe I need to build my own lite version and push it pypi. However, this approach cause further
problem if our customer is using the "fat" version of pyarrow unless you the change the namespace
of lite version of pyarrow.

Another alternative is that I bundle the pyarrow with our library ( copy the whole directory
into vendored namespace) and ship it to our customer without specifying pyarrow as a dependency.
The advantage of this one is that I can build pyarrow with whatever option/sub-module/libraries
I need. However, I tried a lot but failed because pyarrow use absolute import and it will
fail to import the script in the new location. 

Any insight how I should resolve this issue?





This message was sent by Atlassian Jira

View raw message