Hi Max,

I assume (part of) the slowdown comes from trying to import pandas. If I add an "import pandas" to your script, the difference with the first run is much smaller (although still a difference).

Inside the array function, we are lazily importing pandas to check if the input is a pandas object. I suppose that in theory, if the input is a numpy array, we should also be able to avoid this pandas import (maybe switching the order of some checks).

Best,
Joris

On Fri, 28 Aug 2020 at 01:10, Max Grossman <jmaxg3@gmail.com> wrote:

Hi all,

Say I've got a simple program like the following that converts a numpy array to a pyarrow array several times in a row, and times each of those conversions:

import pyarrow
import numpy as np
import time

arr = np.random.rand(1)

t1 = time.time()
pyarrow.array(arr)
t2 = time.time()
pyarrow.array(arr)
t3 = time.time()
pyarrow.array(arr)
t4 = time.time()
pyarrow.array(arr)
t5 = time.time()

I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s while the rest are nearly instantaneous (1e-05s).

Does anyone know what might be causing this? My assumption is some one-time initialization of pyarrow on the first call to the library, in which case I'd like to see if there's some way to explicitly trigger that initialization earlier in the program. But also curious to hear if there is a different explanation.

Right now I'm working around this by just calling pyarrow.array([]) at application start up -- I realize this doesn't actually eliminate the added time, but it does move it out of the critical section for any benchmarking runs.

Thanks,

Max