arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Kisslinger (Jira)" <>
Subject [jira] [Created] (ARROW-7059) Reading parquet file with many columns is still slow for 0.15.1
Date Mon, 04 Nov 2019 22:08:00 GMT
Eric Kisslinger created ARROW-7059:

             Summary: Reading parquet file with many columns is still slow for 0.15.1
                 Key: ARROW-7059
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
         Environment: Linux OS with RHEL 7.7 distribution

blkcqas037:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

            Reporter: Eric Kisslinger

Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared
to 0.14.1. I using the same test used in
except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect
to # of CPUs.

{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}

*In 0.14.1 with use_threads=False:*

{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}

*In 0.15.1 with* *use_threads=False**:*

{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}

This message was sent by Atlassian Jira

View raw message