arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: Parquet file reading performance
Date Tue, 01 Oct 2019 11:07:12 GMT
Some answers to the other questions:

On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <maartenb@xs4all.nl> wrote:

> ...
> This leaves me with the following questions:
>
> - Who should I talk to to get this resolved in Pandas?
>
> You can open an issue on their tracker:
https://github.com/pandas-dev/pandas/issues/


> - Do you think it would be possible to take the DateTime column out of
> Arrow into numpy
> and transform it the to make it more amenable to Pandas? and possibly even
> for the value columns?
>

The reason that this non-ns column takes more time is due to the conversion
to nanoseconds.
So one way this could be to already cast your arrow table to ns before
converting to pandas (this is implemented in arrow, but currently does not
check of out of bounds values; I opened
https://issues.apache.org/jira/browse/ARROW-6704 for this).

But, I am not sure it can be improved with a big factor on the pandas side.
Let's look at some timings using:

arr_int = np.arange(100000)
arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
arr_us = arr_ns.astype("datetime64[us]")

conversion from those arrays to a pandas Series only takes more time for
the microsecond datetimes:

In [6]: %timeit pd.Series(arr_int)
47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit pd.Series(arr_ns)
27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: %timeit pd.Series(arr_us)
6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But just converting microseconds to nanoseconds also already takes almost a
millisecond on this data (with numpy, but arrow is not faster):

In [15]: %timeit arr_us.astype("datetime64[ns]")
742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it should certainly be possible to get those 6ms down to something
around 1ms (and which can be a nice improvement! you're welcome to open an
issue on the pandas issue tracker for it), but it will still be factor of
20 to 30 slower than the nanosecond case.

Joris


> > On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >
> > From looking a little bit further into this, it seems that it is mainly
> > pandas who is slower in creating a Series from an array of datetime64
> > compared from an array of ints.
> > And especially if it is not nanosecond resolution:
> >
> > In [29]: a_int = pa.array(np.arange(100000))
> >
> > In [30]: %timeit a_int.to_pandas()
> > 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'))
> >
> > In [32]: %timeit a_datetime.to_pandas()
> > 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'), pa.timestamp('us'))
> >
> > In [34]: %timeit a_datetime_us.to_pandas()
> > 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > Creating the datetime64 array inside pyarrow is also a bit slower
> compared
> > to int (causing the slower conversion of a_datetime), but the above
> > difference between between nanosecond and microsecond resolution is
> largely
> > due to pandas, not pyarrow (because pandas needs to convert the
> > microseconds to nanoseconds, and during that conversion will also check
> > that no datetimes were out of bounds for this resolution).
> >
> > And in parquet, the datetime data of the index column will be stored in
> > microsecond resolution (even if the original pandas data was nanosecond
> > resolution). And the slower reading of the parquet file with datetime
> index
> > is thus almost entirely due to the above difference in timing of
> converting
> > the int or datetime index column to pandas.
> > Parquet nowadays actually supports storing nanosecond resolution, and
> this
> > can be triggered in pyarrow by passing version="2.0" to pq.write_table
> (but
> > last what I heard this version is not yet considered production ready).
> >
> > Joris
> >
> > On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> >> Hi Maarten,
> >>
> >> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >> master, and not seeing the difference between both datetime indexes:
> >>
> >> Versions:
> >> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >> 23:01:00)
> >> [GCC 7.3.0] on linux
> >> numpy:    1.16.4
> >> pandas:   0.26.0.dev0+447.gc168ecf26
> >> pyarrow:  0.14.1.dev642+g7f2d637db
> >>
> >> 1073741824 float64 8388608 16
> >> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>
> >> The only change I needed to make in the script to get it running (within
> >> my memory limits) was the creation of the second DatetimeIndex
> >> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
> >> 1.5 billion elements, while only the last part of it is used. So changed
> >> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >> freq='S').array[-rows:])
> >>
> >> The datetime index reading in general is still slower as the int index.
> >> But doing a bit more detailed timings, and it seems this is not due to
> the
> >> reading of parquet, but the conversion of arrow to pandas (using the
> files
> >> from the benchmark):
> >>
> >> In [1]: import pyarrow.parquet as pq
> >>
> >> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>
> >> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>
> >> In [8]: %timeit table_int.to_pandas()
> >> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>
> >> In [9]: %timeit table_datetime.to_pandas()
> >> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> So you can see that the parquet reading part is basically identical, but
> >> the conversion to pandas is much slower for the datetime-index case.
> >> I will try to look into that code path to see what makes this so much
> >> slower.
> >>
> >> Joris
> >>
> >>
> >> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maartenb@xs4all.nl>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> The code to show the performance issue with DateTimeIndex is at:
> >>>
> >>>
> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>
> >>> It shows three case 0) int index, 1) datetime index, 2) date time index
> >>> created in a slightly roundabout way
> >>>
> >>> I’m a little confused by the two date time cases. Case 2) is much
> slower
> >>> but the df compares identical to case 1)
> >>> (I originally used something like 2) to match our specific data. I
> don’t
> >>> see why it behaves differently??)
> >>>
> >>> The timings I find are:
> >>>
> >>> 1073741824 float64 8388608 16
> >>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> >>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>
> >>> Case 0, int index.  This is all great.
> >>> Case 1, date time index. We loose almost half the speed. Given that a
> >>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>> Case  3, other datetime index. No idea what is going on.
> >>>
> >>> Any insights are much appreciated.
> >>>
> >>> Cheers,
> >>> Maarten.
> >>>
> >>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>>>
> >>>> hi
> >>>>
> >>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
> maartenb@xs4all.nl
> >>> <mailto:maartenb@xs4all.nl>> wrote:
> >>>>>
> >>>>> Hi Wes,
> >>>>>
> >>>>> Thanks for your quick response.
> >>>>>
> >>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge,
and:
> >>>>>
> >>>>> numpy:           1.16.5
> >>>>> pandas:          0.25.1
> >>>>> pyarrow:         0.14.1
> >>>>>
> >>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>
> >>>>> Theoretically I see three components driving the performance:
> >>>>> 1) The cost of locating the column (directory overhead)
> >>>>> 2) The overhead of reading a single column. (reading and processing
> >>> meta data, setting up for reading)
> >>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>
> >>>>> Only 1) would be impacted by the number of columns, but if you’re
> >>> reading everything ideally this would not be a problem.
> >>>>
> >>>> The problem is more nuanced than that. Parquet's metadata is somewhat
> >>>> "heavy" at the column level. So when you're writing thousands of
> >>>> columns, the fixed overhead associated with reading a single column
> >>>> becomes problematic. There are several data structures associated with
> >>>> decoding a column have a fixed setup and teardown cost. Even if there
> >>>> is 1 millisecond of fixed overhead related to reading a column (I
> >>>> don't know what the number is exactly) then reading 10,000 columns has
> >>>> 10 seconds of unavoidable overhead. It might be useful for us to
> >>>> quantify and communicate the expected overhead when metadata and
> >>>> decoding is taken into account. Simply put having more than 1000
> >>>> columns is not advisable.
> >>>>
> >>>>> Based on an initial cursory look at the Parquet format I guess the
> >>> index and the column meta-data might need to be read in full so I can
> see
> >>> how that might slow down reading only a few columns out of a large
> set. But
> >>> that was not really the case here?
> >>>>>
> >>>>> What would you suggest for looking into the date index slow-down?
> >>>>
> >>>> Can you show a code example to make things easier for us to see what
> >>>> you're seeing?
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmckinn@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> hi Maarten,
> >>>>>>
> >>>>>> Are you using the master branch or 0.14.1? There are a number
of
> >>>>>> performance regressions in 0.14.0/0.14.1 that are addressed
in the
> >>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>
> >>>>>> As a file format, Parquet (and columnar formats in general)
is not
> >>>>>> known to perform well with more than 1000 columns.
> >>>>>>
> >>>>>> On the other items, we'd be happy to work with you to dig through
> the
> >>>>>> performance issues you're seeing.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Wes
> >>>>>>
> >>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>> maartenb@xs4all.nl> wrote:
> >>>>>>>
> >>>>>>> Greetings,
> >>>>>>>
> >>>>>>> We have Pandas DataFrames with typically about 6,000 rows
using
> >>> DateTimeIndex.
> >>>>>>> They have about 20,000 columns with integer column labels,
and data
> >>> with a dtype of float32.
> >>>>>>>
> >>>>>>> We’d like to store these dataframes with parquet, using
the ability
> >>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>
> >>>>>>> We’ve found the reading performance less than expected
compared to
> >>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>
> >>>>>>> Using a modified version of his script we did reproduce
his results
> >>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>
> >>>>>>> But there seem to be three factors that contribute to the
slowdown
> >>> for our datasets:
> >>>>>>>
> >>>>>>> - DateTimeIndex is much slower then a Int index (we see
about a
> >>> factor 5).
> >>>>>>> - The number of columns impact reading speed significantly
(factor
> >>> ~2 going from 16 to 16,000 columns)
> >>>>>>> - The ‘use_pandas_metadata=True’ slows down reading
significantly
> >>> and appears unnecessary? (about 40%)
> >>>>>>>
> >>>>>>> Are there ways we could speedup the reading? Should we use
a
> >>> different layout?
> >>>>>>>
> >>>>>>> Thanks for your help and insights!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten
> >>>>>>>
> >>>>>>>
> >>>>>>> ps. the routines we used:
> >>>>>>>
> >>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) ->
None:
> >>>>>>>  table = pa.Table.from_pandas(df)
> >>>>>>>  pq.write_table(table, fname, use_dictionary=False,
> >>> compression=None)
> >>>>>>>  return
> >>>>>>>
> >>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>  table = pq.read_table(fname, use_pandas_metadata=False,
> >>> use_threads=True)
> >>>>>>>  df = table.to_pandas()
> >>>>>>>  return df
> >>>
> >>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message