spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Jin <ice.xell...@gmail.com>
Subject Re: Spark Data Frame. PreSorded partitions
Date Mon, 04 Dec 2017 15:38:17 GMT
Sorry, s/ordered distributed/ordered distribution/g

On Mon, Dec 4, 2017 at 10:37 AM, Li Jin <ice.xelloss@gmail.com> wrote:

> Just to give another data point: most of the data we use with Spark are
> sorted on disk, having a way to allow data source to pass ordered
> distributed to DataFrames is really useful for us.
>
> On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <nizhikov.dev@gmail.com>
> wrote:
>
>> Hello, guys.
>>
>> Thank you for answers!
>>
>> > I think pushing down a sort .... could make a big difference.
>> > You can however proposes to the data source api 2 to be included.
>>
>> Jörn, are you talking about this jira issue? -
>> https://issues.apache.org/jira/browse/SPARK-15689
>> Is there any additional documentation I has to learn before making any
>> proposition?
>>
>>
>>
>> 04.12.2017 14:05, Holden Karau пишет:
>>
>>> I think pushing down a sort (or really more in the case where the data
>>> is already naturally returned in sorted order on some column) could make a
>>> big difference. Probably the simplest argument for a lot of time being
>>> spent sorting (in some use cases) is the fact it's still one of the
>>> standard benchmarks.
>>>
>>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com
>>> <mailto:jornfranke@gmail.com>> wrote:
>>>
>>>     I do not think that the data source api exposes such a thing. You
>>> can however proposes to the data source api 2 to be included.
>>>
>>>     However there are some caveats , because sorted can mean two
>>> different things (weak vs strict order).
>>>
>>>     Then, is really a lot of time lost because of sorting? The best
>>> thing is to not read data that is not needed at all (see min/max indexes in
>>> orc/parquet or bloom filters in Orc). What is not read
>>>     does not need to be sorted. See also predicate pushdown.
>>>
>>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com
>>> <mailto:nizhikov.dev@gmail.com>> wrote:
>>>      >
>>>      > Cross-posting from @user.
>>>      >
>>>      > Hello, guys!
>>>      >
>>>      > I work on implementation of custom DataSource for Spark Data
>>> Frame API and have a question:
>>>      >
>>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>>> can sort data inside a partition in my data source.
>>>      >
>>>      > Do I have a built-in option to tell spark that data from each
>>> partition already sorted?
>>>      >
>>>      > It seems that Spark can benefit from usage of already sorted
>>> partitions.
>>>      > By using of distributed merge sort algorithm, for example.
>>>      >
>>>      > Does it make sense for you?
>>>      >
>>>      >
>>>      > 28.11.2017 18:42, Michael Artz пишет:
>>>      >> I'm not sure other than retrieving from a hive table that is
>>> already sorted.  This sounds cool though, would be interested to know this
>>> as well
>>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>>> nizhikov.dev@gmail.com <mailto:nizhikov.dev@gmail.com> <mailto:
>>> nizhikov.dev@gmail.com <mailto:nizhikov.dev@gmail.com>>> wrote:
>>>      >>    Hello, guys!
>>>      >>    I work on implementation of custom DataSource for Spark Data
>>> Frame API and have a question:
>>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>>> I can sort data inside a partition in my data source.
>>>      >>    Do I have a built-in option to tell spark that data from each
>>> partition already sorted?
>>>      >>    It seems that Spark can benefit from usage of already sorted
>>> partitions.
>>>      >>    By using of distributed merge sort algorithm, for example.
>>>      >>    Does it make sense for you?
>>>      >>    ------------------------------------------------------------
>>> ---------
>>>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> <mailto:user-unsubscribe@spark.apache.org> <mailto:
>>> user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark
>>> .apache.org>>
>>>      >
>>>      > ------------------------------------------------------------
>>> ---------
>>>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>>> dev-unsubscribe@spark.apache.org>
>>>      >
>>>
>>>     ------------------------------------------------------------
>>> ---------
>>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>>> dev-unsubscribe@spark.apache.org>
>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Mime
View raw message