arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: initialization on first call to pyarrow.array()?
Date Fri, 28 Aug 2020 06:48:12 GMT
Hi Max,

I assume (part of) the slowdown comes from trying to import pandas. If I
add an "import pandas" to your script, the difference with the first run is
much smaller (although still a difference).

Inside the array function, we are lazily importing pandas to check if the
input is a pandas object. I suppose that in theory, if the input is a numpy
array, we should also be able to avoid this pandas import (maybe switching
the order of some checks).

Best,
Joris

On Fri, 28 Aug 2020 at 01:10, Max Grossman <jmaxg3@gmail.com> wrote:

> Hi all,
>
> Say I've got a simple program like the following that converts a numpy
> array to a pyarrow array several times in a row, and times each of those
> conversions:
>
> import pyarrow
> import numpy as np
> import time
>
> arr = np.random.rand(1)
>
> t1 = time.time()
> pyarrow.array(arr)
> t2 = time.time()
> pyarrow.array(arr)
> t3 = time.time()
> pyarrow.array(arr)
> t4 = time.time()
> pyarrow.array(arr)
> t5 = time.time()
>
> I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s
> while the rest are nearly instantaneous (1e-05s).
>
> Does anyone know what might be causing this? My assumption is some
> one-time initialization of pyarrow on the first call to the library, in
> which case I'd like to see if there's some way to explicitly trigger that
> initialization earlier in the program. But also curious to hear if there is
> a different explanation.
>
> Right now I'm working around this by just calling pyarrow.array([]) at
> application start up -- I realize this doesn't actually eliminate the added
> time, but it does move it out of the critical section for any benchmarking
> runs.
>
> Thanks,
>
> Max
>

Mime
View raw message