spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Spark Data Frame. PreSorded partitions
Date Mon, 04 Dec 2017 11:05:39 GMT
I think pushing down a sort (or really more in the case where the data is
already naturally returned in sorted order on some column) could make a big
difference. Probably the simplest argument for a lot of time being spent
sorting (in some use cases) is the fact it's still one of the standard
benchmarks.

On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com> wrote:

> I do not think that the data source api exposes such a thing. You can
> however proposes to the data source api 2 to be included.
>
> However there are some caveats , because sorted can mean two different
> things (weak vs strict order).
>
> Then, is really a lot of time lost because of sorting? The best thing is
> to not read data that is not needed at all (see min/max indexes in
> orc/parquet or bloom filters in Orc). What is not read does not need to be
> sorted. See also predicate pushdown.
>
> > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com>
wrote:
> >
> > Cross-posting from @user.
> >
> > Hello, guys!
> >
> > I work on implementation of custom DataSource for Spark Data Frame API
> and have a question:
> >
> > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort
> data inside a partition in my data source.
> >
> > Do I have a built-in option to tell spark that data from each partition
> already sorted?
> >
> > It seems that Spark can benefit from usage of already sorted partitions.
> > By using of distributed merge sort algorithm, for example.
> >
> > Does it make sense for you?
> >
> >
> > 28.11.2017 18:42, Michael Artz пишет:
> >> I'm not sure other than retrieving from a hive table that is already
> sorted.  This sounds cool though, would be interested to know this as well
> >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com
> <mailto:nizhikov.dev@gmail.com>> wrote:
> >>    Hello, guys!
> >>    I work on implementation of custom DataSource for Spark Data Frame
> API and have a question:
> >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can
> sort data inside a partition in my data source.
> >>    Do I have a built-in option to tell spark that data from each
> partition already sorted?
> >>    It seems that Spark can benefit from usage of already sorted
> partitions.
> >>    By using of distributed merge sort algorithm, for example.
> >>    Does it make sense for you?
> >>    ------------------------------------------------------------
> ---------
> >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:
> user-unsubscribe@spark.apache.org>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Twitter: https://twitter.com/holdenkarau

Mime
View raw message