arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: Parquet file reading performance
Date Tue, 01 Oct 2019 19:33:00 GMT
On Tue, 1 Oct 2019 at 21:03, Maarten Ballintijn <maartenb@xs4all.nl> wrote:

>
> I ran cProfile to understand better what is going on in Pandas. Using your
> code below I find that
> Pandas runs a loop over generic the datetime64 conversion in case the
> datetime64 is not in ’ns’.
> The conversion unpacks the time into a date-time struct and converts the
> date-time struct back
> into a Timestamp. This is clearly slow and unnecessary.
>
> Adding a dedicated us->ns conversion should be possible with much better
> performance.
> (either in Cython or C).
>
> I can add a request for this on Pandas-issues unless someone thinks this
> is not a good idea?
>
> Hi Maarten,

Also speaking as a pandas developer, I think it is certainly worth raising
this on the pandas issue tracker.


> Somewhat surprisingly, specifying the dtype on the pd.Series constructor
> triggers another code path
> which is a further factor ~5 slower.
>

Hmm, it seems in that case it is falling back to `pd.to_datetime`, which
has apparently no special support at all for non-ns datetime64 arrays, and
converts it first to an object array to then parse it again ..

Joris

>
>
> > On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >
> > Some answers to the other questions:
> >
> > On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <maartenb@xs4all.nl>
> wrote:
> >
> >> ...
> >> This leaves me with the following questions:
> >>
> >> - Who should I talk to to get this resolved in Pandas?
> >>
> >> You can open an issue on their tracker:
> > https://github.com/pandas-dev/pandas/issues/
> >
> >
> >> - Do you think it would be possible to take the DateTime column out of
> >> Arrow into numpy
> >> and transform it the to make it more amenable to Pandas? and possibly
> even
> >> for the value columns?
> >>
> >
> > The reason that this non-ns column takes more time is due to the
> conversion
> > to nanoseconds.
> > So one way this could be to already cast your arrow table to ns before
> > converting to pandas (this is implemented in arrow, but currently does
> not
> > check of out of bounds values; I opened
> > https://issues.apache.org/jira/browse/ARROW-6704 for this).
> >
> > But, I am not sure it can be improved with a big factor on the pandas
> side.
> > Let's look at some timings using:
> >
> > arr_int = np.arange(100000)
> > arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
> > arr_us = arr_ns.astype("datetime64[us]")
> >
> > conversion from those arrays to a pandas Series only takes more time for
> > the microsecond datetimes:
> >
> > In [6]: %timeit pd.Series(arr_int)
> > 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [7]: %timeit pd.Series(arr_ns)
> > 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [8]: %timeit pd.Series(arr_us)
> > 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > But just converting microseconds to nanoseconds also already takes
> almost a
> > millisecond on this data (with numpy, but arrow is not faster):
> >
> > In [15]: %timeit arr_us.astype("datetime64[ns]")
> > 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > So it should certainly be possible to get those 6ms down to something
> > around 1ms (and which can be a nice improvement! you're welcome to open
> an
> > issue on the pandas issue tracker for it), but it will still be factor of
> > 20 to 30 slower than the nanosecond case.
> >
> > Joris
> >
> >
> >>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
> >> jorisvandenbossche@gmail.com> wrote:
> >>>
> >>> From looking a little bit further into this, it seems that it is mainly
> >>> pandas who is slower in creating a Series from an array of datetime64
> >>> compared from an array of ints.
> >>> And especially if it is not nanosecond resolution:
> >>>
> >>> In [29]: a_int = pa.array(np.arange(100000))
> >>>
> >>> In [30]: %timeit a_int.to_pandas()
> >>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops
> each)
> >>>
> >>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> >>> freq='S'))
> >>>
> >>> In [32]: %timeit a_datetime.to_pandas()
> >>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
> each)
> >>>
> >>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> >>> freq='S'), pa.timestamp('us'))
> >>>
> >>> In [34]: %timeit a_datetime_us.to_pandas()
> >>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>>
> >>> Creating the datetime64 array inside pyarrow is also a bit slower
> >> compared
> >>> to int (causing the slower conversion of a_datetime), but the above
> >>> difference between between nanosecond and microsecond resolution is
> >> largely
> >>> due to pandas, not pyarrow (because pandas needs to convert the
> >>> microseconds to nanoseconds, and during that conversion will also check
> >>> that no datetimes were out of bounds for this resolution).
> >>>
> >>> And in parquet, the datetime data of the index column will be stored in
> >>> microsecond resolution (even if the original pandas data was nanosecond
> >>> resolution). And the slower reading of the parquet file with datetime
> >> index
> >>> is thus almost entirely due to the above difference in timing of
> >> converting
> >>> the int or datetime index column to pandas.
> >>> Parquet nowadays actually supports storing nanosecond resolution, and
> >> this
> >>> can be triggered in pyarrow by passing version="2.0" to pq.write_table
> >> (but
> >>> last what I heard this version is not yet considered production ready).
> >>>
> >>> Joris
> >>>
> >>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> >>> jorisvandenbossche@gmail.com> wrote:
> >>>
> >>>> Hi Maarten,
> >>>>
> >>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >>>> master, and not seeing the difference between both datetime indexes:
> >>>>
> >>>> Versions:
> >>>> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >>>> 23:01:00)
> >>>> [GCC 7.3.0] on linux
> >>>> numpy:    1.16.4
> >>>> pandas:   0.26.0.dev0+447.gc168ecf26
> >>>> pyarrow:  0.14.1.dev642+g7f2d637db
> >>>>
> >>>> 1073741824 float64 8388608 16
> >>>> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >>>> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >>>> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >>>> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >>>> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >>>> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >>>> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >>>> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >>>> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>>>
> >>>> The only change I needed to make in the script to get it running
> (within
> >>>> my memory limits) was the creation of the second DatetimeIndex
> >>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index
> of
> >>>> 1.5 billion elements, while only the last part of it is used. So
> changed
> >>>> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >>>> freq='S').array[-rows:])
> >>>>
> >>>> The datetime index reading in general is still slower as the int
> index.
> >>>> But doing a bit more detailed timings, and it seems this is not due
to
> >> the
> >>>> reading of parquet, but the conversion of arrow to pandas (using the
> >> files
> >>>> from the benchmark):
> >>>>
> >>>> In [1]: import pyarrow.parquet as pq
> >>>>
> >>>> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>>>
> >>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>>>
> >>>> In [8]: %timeit table_int.to_pandas()
> >>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops
each)
> >>>>
> >>>> In [9]: %timeit table_datetime.to_pandas()
> >>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> So you can see that the parquet reading part is basically identical,
> but
> >>>> the conversion to pandas is much slower for the datetime-index case.
> >>>> I will try to look into that code path to see what makes this so much
> >>>> slower.
> >>>>
> >>>> Joris
> >>>>
> >>>>
> >>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maartenb@xs4all.nl>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> The code to show the performance issue with DateTimeIndex is at:
> >>>>>
> >>>>>
> >> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>>>
> >>>>> It shows three case 0) int index, 1) datetime index, 2) date time
> index
> >>>>> created in a slightly roundabout way
> >>>>>
> >>>>> I’m a little confused by the two date time cases. Case 2) is much
> >> slower
> >>>>> but the df compares identical to case 1)
> >>>>> (I originally used something like 2) to match our specific data.
I
> >> don’t
> >>>>> see why it behaves differently??)
> >>>>>
> >>>>> The timings I find are:
> >>>>>
> >>>>> 1073741824 float64 8388608 16
> >>>>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>>>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>>>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s
<<<
> >>>>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>>>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>>>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>>>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>>>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>>>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>>>
> >>>>> Case 0, int index.  This is all great.
> >>>>> Case 1, date time index. We loose almost half the speed. Given that
a
> >>>>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>>>> Case  3, other datetime index. No idea what is going on.
> >>>>>
> >>>>> Any insights are much appreciated.
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmckinn@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>> hi
> >>>>>>
> >>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
> >> maartenb@xs4all.nl
> >>>>> <mailto:maartenb@xs4all.nl>> wrote:
> >>>>>>>
> >>>>>>> Hi Wes,
> >>>>>>>
> >>>>>>> Thanks for your quick response.
> >>>>>>>
> >>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge,
and:
> >>>>>>>
> >>>>>>> numpy:           1.16.5
> >>>>>>> pandas:          0.25.1
> >>>>>>> pyarrow:         0.14.1
> >>>>>>>
> >>>>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>>>
> >>>>>>> Theoretically I see three components driving the performance:
> >>>>>>> 1) The cost of locating the column (directory overhead)
> >>>>>>> 2) The overhead of reading a single column. (reading and
processing
> >>>>> meta data, setting up for reading)
> >>>>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>>>
> >>>>>>> Only 1) would be impacted by the number of columns, but
if you’re
> >>>>> reading everything ideally this would not be a problem.
> >>>>>>
> >>>>>> The problem is more nuanced than that. Parquet's metadata is
> somewhat
> >>>>>> "heavy" at the column level. So when you're writing thousands
of
> >>>>>> columns, the fixed overhead associated with reading a single
column
> >>>>>> becomes problematic. There are several data structures associated
> with
> >>>>>> decoding a column have a fixed setup and teardown cost. Even
if
> there
> >>>>>> is 1 millisecond of fixed overhead related to reading a column
(I
> >>>>>> don't know what the number is exactly) then reading 10,000 columns
> has
> >>>>>> 10 seconds of unavoidable overhead. It might be useful for us
to
> >>>>>> quantify and communicate the expected overhead when metadata
and
> >>>>>> decoding is taken into account. Simply put having more than
1000
> >>>>>> columns is not advisable.
> >>>>>>
> >>>>>>> Based on an initial cursory look at the Parquet format I
guess the
> >>>>> index and the column meta-data might need to be read in full so
I can
> >> see
> >>>>> how that might slow down reading only a few columns out of a large
> >> set. But
> >>>>> that was not really the case here?
> >>>>>>>
> >>>>>>> What would you suggest for looking into the date index slow-down?
> >>>>>>
> >>>>>> Can you show a code example to make things easier for us to
see what
> >>>>>> you're seeing?
> >>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmckinn@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> hi Maarten,
> >>>>>>>>
> >>>>>>>> Are you using the master branch or 0.14.1? There are
a number of
> >>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed
in the
> >>>>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>>>
> >>>>>>>> As a file format, Parquet (and columnar formats in general)
is not
> >>>>>>>> known to perform well with more than 1000 columns.
> >>>>>>>>
> >>>>>>>> On the other items, we'd be happy to work with you to
dig through
> >> the
> >>>>>>>> performance issues you're seeing.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Wes
> >>>>>>>>
> >>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>>>> maartenb@xs4all.nl> wrote:
> >>>>>>>>>
> >>>>>>>>> Greetings,
> >>>>>>>>>
> >>>>>>>>> We have Pandas DataFrames with typically about 6,000
rows using
> >>>>> DateTimeIndex.
> >>>>>>>>> They have about 20,000 columns with integer column
labels, and
> data
> >>>>> with a dtype of float32.
> >>>>>>>>>
> >>>>>>>>> We’d like to store these dataframes with parquet,
using the
> ability
> >>>>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>>>
> >>>>>>>>> We’ve found the reading performance less than
expected compared
> to
> >>>>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>>>
> >>>>>>>>> Using a modified version of his script we did reproduce
his
> results
> >>>>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>>>
> >>>>>>>>> But there seem to be three factors that contribute
to the
> slowdown
> >>>>> for our datasets:
> >>>>>>>>>
> >>>>>>>>> - DateTimeIndex is much slower then a Int index
(we see about a
> >>>>> factor 5).
> >>>>>>>>> - The number of columns impact reading speed significantly
> (factor
> >>>>> ~2 going from 16 to 16,000 columns)
> >>>>>>>>> - The ‘use_pandas_metadata=True’ slows down
reading significantly
> >>>>> and appears unnecessary? (about 40%)
> >>>>>>>>>
> >>>>>>>>> Are there ways we could speedup the reading? Should
we use a
> >>>>> different layout?
> >>>>>>>>>
> >>>>>>>>> Thanks for your help and insights!
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Maarten
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ps. the routines we used:
> >>>>>>>>>
> >>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname:
str) -> None:
> >>>>>>>>> table = pa.Table.from_pandas(df)
> >>>>>>>>> pq.write_table(table, fname, use_dictionary=False,
> >>>>> compression=None)
> >>>>>>>>> return
> >>>>>>>>>
> >>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False,
> >>>>> use_threads=True)
> >>>>>>>>> df = table.to_pandas()
> >>>>>>>>> return df
> >>>>>
> >>>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message