spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Spark Data Frame. PreSorded partitions
Date Mon, 04 Dec 2017 09:55:18 GMT
I do not think that the data source api exposes such a thing. You can however proposes to the
data source api 2 to be included.

However there are some caveats , because sorted can mean two different things (weak vs strict
order).

Then, is really a lot of time lost because of sorting? The best thing is to not read data
that is not needed at all (see min/max indexes in orc/parquet or bloom filters in Orc). What
is not read does not need to be sorted. See also predicate pushdown.

> On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com>
wrote:
> 
> Cross-posting from @user.
> 
> Hello, guys!
> 
> I work on implementation of custom DataSource for Spark Data Frame API and have a question:
> 
> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside
a partition in my data source.
> 
> Do I have a built-in option to tell spark that data from each partition already sorted?
> 
> It seems that Spark can benefit from usage of already sorted partitions.
> By using of distributed merge sort algorithm, for example.
> 
> Does it make sense for you?
> 
> 
> 28.11.2017 18:42, Michael Artz пишет:
>> I'm not sure other than retrieving from a hive table that is already sorted.  This
sounds cool though, would be interested to know this as well
>> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com
<mailto:nizhikov.dev@gmail.com>> wrote:
>>    Hello, guys!
>>    I work on implementation of custom DataSource for Spark Data Frame API and have
a question:
>>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data
inside a partition in my data source.
>>    Do I have a built-in option to tell spark that data from each partition already
sorted?
>>    It seems that Spark can benefit from usage of already sorted partitions.
>>    By using of distributed merge sort algorithm, for example.
>>    Does it make sense for you?
>>    ---------------------------------------------------------------------
>>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark.apache.org>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message