arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: initialization on first call to pyarrow.array()?
Date Fri, 28 Aug 2020 14:32:34 GMT
The Arrow C++ libraries do some other one-time static initialization
so we should find out if it's all due to importing pandas or something
else

On Fri, Aug 28, 2020 at 1:48 AM Joris Van den Bossche
<jorisvandenbossche@gmail.com> wrote:
>
> Hi Max,
>
> I assume (part of) the slowdown comes from trying to import pandas. If I add an "import
pandas" to your script, the difference with the first run is much smaller (although still
a difference).
>
> Inside the array function, we are lazily importing pandas to check if the input is a
pandas object. I suppose that in theory, if the input is a numpy array, we should also be
able to avoid this pandas import (maybe switching the order of some checks).
>
> Best,
> Joris
>
> On Fri, 28 Aug 2020 at 01:10, Max Grossman <jmaxg3@gmail.com> wrote:
>>
>> Hi all,
>>
>> Say I've got a simple program like the following that converts a numpy array to a
pyarrow array several times in a row, and times each of those conversions:
>>
>> import pyarrow
>> import numpy as np
>> import time
>>
>> arr = np.random.rand(1)
>>
>> t1 = time.time()
>> pyarrow.array(arr)
>> t2 = time.time()
>> pyarrow.array(arr)
>> t3 = time.time()
>> pyarrow.array(arr)
>> t4 = time.time()
>> pyarrow.array(arr)
>> t5 = time.time()
>>
>> I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s while the
rest are nearly instantaneous (1e-05s).
>>
>> Does anyone know what might be causing this? My assumption is some one-time initialization
of pyarrow on the first call to the library, in which case I'd like to see if there's some
way to explicitly trigger that initialization earlier in the program. But also curious to
hear if there is a different explanation.
>>
>> Right now I'm working around this by just calling pyarrow.array([]) at application
start up -- I realize this doesn't actually eliminate the added time, but it does move it
out of the critical section for any benchmarking runs.
>>
>> Thanks,
>>
>> Max

Mime
View raw message